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Preface 



About This Manual 

This manual provides information on implementing FORTRAN 77 programs 
using IRIX™ and the IRIS-4D™ Series workstation. This implementation of 
FORTRAN 77 contains full American National Standard (ANSI) Programming 
Language FORTRAN (X3.9— 1978). It has extensions that provide full VMS 
FORTRAN compatibility to the extent possible without the VMS operating 
system or VAX data representation. It also contains extensions that provide 
partial compatibility with programs written in S VS FORTRAN and 
FORTRAN 66. 

FORTRAN 77 is referred to as FORTRAN throughout this manual except where 
distinctions between FORTRAN 77 and FORTRAN 66 are being specifically 
discussed. 



Corequisite Publications 



Refer to the FORTRAN 77 Language Reference Manual for a description of the 
FORTRAN Language as implemented by the Silicon Graphics IRIS-4D Series 
workstation. 

Refer to the IRIS-4D Series Compiler Guide for information on the following 
topics: 

• An overview of the compiler system. 

Improving program performance and using the profiling and optimization 
facilities of the compiler system. 

The dump utilities, archiver, and other tools for maintaining FORTRAN 
programs. 

-vii- 



Refer to the dbx Reference Manual for a detailed description of the debugger. 

For information on interfaces to programs written in assembly language, refer to 
the Assembly Language Programmer's Guide. 



Appendix A: Runtime Error Messages, Lists the error messages that can 
be generated during program execution. 



( 



Topics Covered 

Chapter 1: Compiling, Linking, and Running Programs, Gives an 
overview of components of the compiler system, describes how to compile, 
link edit, and execute a FORTRAN program. Also describes special 
considerations for programs running on IRIX systems, such as file format 
and error handling. 

Chapter 2: Storage Mapping. Describes how the FORTRAN compiler 
implements size and value ranges for various data types, and how they are 
mapped to storage. Also, describes how to access misaligned data. 

Chapter 3: FORTRAN Program Interfaces. Provides reference and 
guide information on writing programs in FORTRAN, C, and Pascal that 
can communicate with each other. Also, describes the process of generating 
wrappers for C routines called by FORTRAN. | 

Chapter 4: System Functions and Subroutines. Describes functions and 
subroutines that can be used with a program to communicate with the IRIX 
operating system. 

Chapter 5: FORTRAN Enhancements for Multiprocessors. Describes 
programming directives for running FORTRAN programs in a 
multiprocessor mode. 

Chapter 6: Compiling and Debugging Parallel FORTRAN. Describes 
and illustrates compilation and debugging techniques for running 
FORTRAN programs in a multiprocessor mode. 
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1. Compiling, Linking, and Running 
Programs 

This chapter contains the following major sections: 

Compiling and Linking, which describes the compilation environment and 
how to compile and link FORTRAN programs. Examples are included to 
show how to create separate linkable objects written in FORTRAN, C, 
Pascal, or other languages supported by the compiler system, and to link 
them into an executable object program. 

Driver Options, which gives an overview of debugging, profiling, 
optimizing, and other options provided with the FORTRAN/77 driver. 

Object File Tools, which briefly summarizes the capabilities of the odump, 
run, file and size programs that provide listing and other information on 
object files. 

Archiver, which summarizes the functions of the ar program that maintains 
archive libraries. 

Runtime Considerations, which describes how to invoke a FORTRAN 
program, how the operating system treats files, and runtime error handling. 

You should refer to the FORTRAN Release Notes that accompanied your 
FORTRAN installation package. This document sometimes lists compiler 
enhancements, and possible compiler errors, and how to circumvent them. 
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1.1 Compiling and Linking 



1.1.1 Drivers 



( 



Intelligent programs called drivers actually invoke the major components of the 
compiler system: the FORTRAN compiler, the intermediate code optimizer, the 
code generator, the assembler, and the link editor. The/77 command runs the 
driver that causes your programs to be compiled, optimized, assembled, and link 
edited. 

The format of the /77 driver command is as follows: 

f77 [option] ... filename, f [option] 

where, 



f77 invokes the various processing phases that compile, optimize, 

assemble, and link edit the program. 
option represents the driver options, through which you provide 

instructions to the processing phases. They can be anywhere 

in the command line. These options are discussed later in the 

chapter. 
filename./ is the name of the file that contains the FORTRAN source 

statements. The file name must always contain the suffix /, 

for example, myprog.f. 



( 
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1.1.2 Compilation 

The driver commandf77 can both compile and link edit a source module. The 
next figure shows how the primary drivers phases, and their principal inputs and 
outputs for the source modules more.f. 



more.f 










FORTRAN Front End 














Optimizer 



(optbnal) 



Code Generator 



Assembler 
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more.o 


Link Editor 
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a. out 







Figure 1-1. The Compilation Process 

Note the following: 

The source file ends with the required suffix ./. 

The source file is passed through the C preprocessor, cpp, by default. The 
-nocpp option prevents this. In the example 

f77 myprog.f -nocpp 

the file myprog.f will not be preprocessed by cpp. 

The driver produces a linkable object file when you specify the -c driver 
option. This file has the same name as the source file, except with the suffix 
.o. For example, the following command line: 

f77 more.f -c 

produces the more.o file in the above example. 
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The default name of the executable object file is a.out. For example, the 
command line 

f77 myprog.f 

produces the executable object a.out 

You can specify a name other than a.out for the executable object by using 
the driver option -o name, where name is the name of the executable object. 
For example, the following command line: 

f77 more.o -0 exec.myprog 

link edits the object module more.o and produces an executable object 
named exec.myprog. 

The following command line: 

f77 myprog.f -0 exec.myprog 

compiles and link edits the source module myprog.f and produces an 
executable object named exec.myprog. 

To compile and link a program with the IRIS graphics library use the -Zg 
compiler option. For example, the following command line: 

f77 myprog.f -Zg 
compiles and links the graphics program myprog.f 'and produces a.out. 



( 
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1.1.3 Compiling Multi-Language Programs 

The compiler system provides drivers for other languages, including C 9 Pascal, 
COBOL, and PL/1. If one of these drivers is installed in your system, you can 
compile and link your FORTRAN programs to the language supported by the 
driver. (See the IRIX Series Compiler Guide for a list of available •drivers and 
the commands that invoke them; refer to Chapter 3 of this manual for 
conventions you must follow in •writing FORTRAN program interfaces to C and 
Pascal program.) 

When your application has two or more source programs written in different 
languages, you should compile each program module separately with the 
appropriate driver and then link them in a separate step. You can create objects 
suitable for link editing by specifying the ~c option, which stops the driver 
immediately after the assembler phase. For example: 

cc -c main.c 
£77 -c rest.f 

The two command lines shown above produce linkable objects named main.o 
and resto, as illustrated in the next figure. 
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Figure 1-2. Compiling Multi-Language Programs 

1 ,1 .4 Linking Objects 

You can also use the/77 driver command to link edit separate objects into one 
executable program when the main program is written in FORTRAN. The 
driver recognizes the .o suffix as the name of a file containing object code 
suitable for link editing and immediately invokes the link editor. The following 
command link edits the object created in the last example: 

£77 -0 all main.o rest.o 

This statement produces the executable program object alh If the main program 
is not written in FORTRAN, you should use the applicable driver. For example, 
if the main program were written in C, you would use the cc driver command, as 
shown below: 



c 



( 



( 



cc -o all main.o rest.o -1F77 -1U77 -1177 -lisam -1m 
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The figure below shows the flow of control for this link edit. 



main.o rest.o 



Link Editor 



i 




|FORTRAN| 



all 



Figure 1-3. Link Editing 



Both pi and cc use the C link library by default. However, the cc driver 
command does not know the names of the link libraries required by the 
FORTRAN objects; therefore, you must specify them explicitly to the link editor 
using the -1 option as shown in the example. The characters following -1 are 
shorthand for link library files as shown in this table: 



-I 



Link Library 



Contents 



F77 


lusrllibllibF77.a 


177 


lusrllibllibI77.a 


I77_mp 


lusrllibllibI77_mp.a 


U77 


lusrllibllibU77.a 


isam 


lusrllibllibisam.a 


fgl 


lusrllibllibfglla 


m 


lusrllibllibm.a 



FORTRAN intrinsic function library 

FORTRAN I/O library 

FORTRAN multiprocessing I/O library 

FORTRAN IRIX interface library 

Indexed sequential access method library 

FORTRAN graphics library 

Math library 



Table 1-1. Link Libraries 

See the FILES section in/77(l) of the IRIX User's Reference Manual for a 
complete list of the files used by the FORTRAN driver. See W(l) in the same 
manual for information on specifying the -1 option. 
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1.1.5 Specifying Link Libraries 

You must explicitly load any required runtime libraries when compiling multi- 
language programs. For example, when you link a program written in 
FORTRAN and some procedures written in PASCAL, you must explicitly load 
the PASCAL library libp.a and the math library lihm.a with the options -Ip and 
-Im (abbreviations for the libraries libp.a and libm.a). This is demonstrated in 
the next example. 

% f77 main.o more.o rest.o -lp -1m 

To find the Pascal library, the link editor replaces the -1 with lib and adds an .a 
after p. Then, it searches the /lib, lusrllih, and lusrllocalllib directories for this 
library. For a list of the libraries that a language uses, see the associated driver 
manual page cc(l),/77(l), or pc(l) in the IRIX User' s Reference Manual, 
FORTRAN 77 Language Reference Manual Pages, or Pascal Reference Manual 
Pages. 

You may need to specify libraries when you use IRIX system packages that are 
not part of a particular language. Most of the manual pages for these packages 
list the required libraries. For example, the getwd(3B) subroutine requires the 
BSD compatibility library libbsd.a. This library is specified as follows: 

% f77 main.o more.o rest.o -Ibsd 

To specify a library created with the archiver, type in the pathname of the library 
as shown below. 

% 111 main.o more.o rest.o libfft.a 

Note: The link editor searches libraries in the order you specify. Therefore, if 
you have a library (for example, libfft.a) that uses data or procedures 
from -lp, you MUST specify libfft.a first. 



1.2 Driver Options 

This section contains a summary of the FORTRAN-specific driver options. See 
the/77(l) in the FORTRAN 77 Reference Manual Pages for a complete 
description of the compiler options; see W(l) in the same manual for a 
description of the link editor options. 
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Option Name 



Purpose 



-456 



-align8 

-alignl6 

-align32 



-backslash 

-C 

-check bounds 



Permits compilation of FORTRAN 66 source programs. 
When used at compile time, the following three options 
generate various degrees of misaligned data in common 
blocks, and the code to deal with the misalignment. 

Note: When specified, these options can degrade 

program performance; -align8 causes the greatest 
degree of degradation and -align32 causes the 
least. 

Permits objects larger than 8 bits to be aligned on 8-bit 
boundaries. Using this option will have the largest impact 
on performance. 

Permits objects larger than 16 bits to be aligned on 16-bit 
boundaries; 16-bit objects must still be aligned on 16-bit 
boundaries (MC68000-like alignment rules). 

Permits objects larger than 32 bits to be aligned on 32-bit 
boundaries; 16-bit objects must still be aligned on 16-bit 
boundaries , and 32-bit objects must still be aligned on 
32-bit boundaries. This is the default alignment. 

You must specify the appropriate alignment option in the 
compilation of all modules that reference or define 
common blocks with misaligned data. Failure to do so 
could cause core dumps (if the trap handler is not used), 
or mismatched common blocks. 

To load the system libraries capable of handling 
misaligned data, use the -Llusrlliblalign switch at load 
time. The trap handler may be needed to handle 
misaligned data passed to system libraries which are not 
included in the lusrllihlalign directory ($eefixade(3f) and 
unalign(3x)). 

Allows the backslash character to be used as a normal 
FORTRAN character instead of the beginning of an 
escape sequence. 

Generate code for runtime subscript range checking. The 
default suppresses range checking. 

Causes an error message to be issued at runtime when the 
value of an array subscript expression exceeds the bounds 
declared for the array. 
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Option Name 



Purpose 



-<hunk=integer 



-col72 



Has the same effect as putting a c$CHUNK=i/iteger 
directive at the beginning of the file. See chapters 5 and 6 
of this manual for more details. 

Sets the source statement format to the following: 



( 



-C01120 



Column 


Contents 


1-5 


Statement label 


6 


Continuation indicator 


7-72 


Statement body 


73-end 


Ignored 



If the source statement contains fewer than 72 characters, 
no blank padding occurs; the TAB-format facility is 
disabled. 

This option provides the SYS FORTRAN 72-column 
option mode. 

Sets the source statement format to the following: 



Column 


Contents 


1-5 


Statement label 


6 


Continuation indicator 


7-120 


Statement body 


121-end 


Ignored 



( 



-cpp 



-d lines 



If the source statement contains fewer than 120 
characters, no blank padding occurs; the TAB-format 
facility is disabled. 

This option provides the S VS FORTRAN default mode. 

Runs the C macro preprocessor on all source files, 
including those created by RATFOR, before compilation. 
(This is the default option.) 

Causes any lines with a D in column 1 to be compiled. 
By default, the compiler treats all lines with a character in 
column 1 as comment lines. 



( 
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Option Name 


Purpose 




-extend_source 


Sets the source 


statement format to the following: 




Column 


Contents 




1-5 
6 

7-132 
133-end 


Statement label 
Continuation indicator 
Statement body 
Warning message issued 



-F 

-i2 



-m 



-mp 



If the source statement contains fewer than 132 
characters, blanks are assumed at the end; the ability of 
TAB-formatted lines to extend past column 132 is 
disabled. 

This option provides VMS FORTRAN 132-column mode, 
except that a warning, instead of a fatal error message, is 
generated when text extends beyond column 132. 

Calls the RATFOR preprocessor only, and puts the output 
in a /file. Doesn't produce .o files. 

All small integer constants become INTEGER*2. All 
variables and functions implicitly or explicitly declared 
type INTEGER or LOGICAL (without a size designator, 
i.e., *2, *4, etc.) will be INTEGER*2 or LOGICAL*2 
respectively. 

If the generic function results don't determine the 
precision of an integer- valued intrinsic function, the 
compiler chooses the precisions that return INTEGER*2. 
The default is INTEGER*4. Note that INTEGER*2 and 
LOGICAL*2 quantities don't obey the FORTRAN 
standard rules for storage location. 

Applies the M4 macro preprocessor to source files to be 
transformed with RATFOR. The driver puts the result in 
a .p file. Unless you specify the -K option, the compiler 
removes the .p file on completion. See the m4{\) 
description in the IRIX User's Reference Manual for 
details. 

Enable the multiprocessing directives. See chapters 5 and 
6 of this book, and the man page on/77(l) for further 
options affecting multiprocessing compilation. 
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Option Name 



Purpose 



-mp_schedtype=type 



-N[qxscnl]jfwii 



-nocpp 
-noextend source 



~noi4 
-nof77 



-old rl 



Has the same effect as putting a c$mp__schedtype= type 
directive at the beginning of the file. The supported types 
are simple, interleave, dynamic, gss, and runtime. See 
chapters 5 and 6 of this manual for more details. 

nnn is a decimal number changing the default size of the 
static tables in the compiler. See the/77(l) description in 
the FORTRAN 77 Reference Manual Pages for details. 

Does not run the C preprocessor on the source files. 
Sets the source statement format to the following: 



Column 


Contents 


1-5 


Statement Label 


6 


Continuation indicator 


7-72 


Statement Pody 


73-end 


Ignored 



If the source statement contains fewer than 72 characters, 
blanks are assumed at the end; the ability of TAB- 
formatted lines to extend past column 72 is disabled. 

This option provides VMS FORTRAN default mode. 

Same as -i2 option. 

Same as -onetrip switch except for the following: 

1 . EXTERNAL statements have an altered syntax and 
functionality. 

2. The default value for the BLANK= clause in an 
OPEN statement is 'ZERO'. 

3. The default value for the STATUS= clause in an 
OPEN statement is 'NEW. 

Interprets the record length specifier for a direct 
unformatted file as a number of bytes instead of a number 
of words. This option provides backward compatibility 
with 4D1-3.1 releases and earlier. 



( 



( 



( 



1-12 FORTRAN 77 Programmer's Guide 



IRIS-4D Series 



Option Name 

-onetrip 
-1 



-pfa 



-Rflags 



-static 



-U 



-u 



Purpose 

Compiles DO loops so that they execute at least once if 
reached. By default, DO loops are not executed if the 
upper limit is smaller than the lower limit. Similar to the 
-nof77 option. 

Run the/?/a(l) preprocessor to discover automatically 
parallelism in the source code. This also enables the 
multiprocessing directives. There are two optional 
arguments: -pfa list will run pfa, and also produce a 
listing file with suffix ./ explaining which loops were 
parallelized, and if not, why not. -pfa keep runs pfa, 
produces the listing file, and also keeps the transformed 
multiprocessed FORTRAN intermediate file in a file with 
suffix .a. 

flags is a valid option for the RATFOR preprocessor; the 
flags are given in the ratfor(l) page in the FORTRAN 77 
Reference Manual Pages. 

The RATFOR input file name is filenames. The resulting 
output is placed in filename f. You must specify the -K 
option to retain the output file. 

Local variables are saved in one static location; 
subsequent calls to the procedure containing the variables 
can change their values. This overrides the default 
-automatic option. 

Causes the compiler to differentiate upper- and lower-case 
alphabetic characters. For example, the compiler 
considers a and A as distinct characters. Note that this 
option causes the compiler to recognize lowercase 
keywords only. Therefore, lowercase keywords must be 
used in writing case-sensitive programs (or in writing 
generic header files). 

Turns off FORTRAN default data typing and any data 
typing explicitly specified in an IMPLICIT statement. 
Forces the explicit declaration of all data types. 
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Option Name Purpose 



-vmscc Uses VMS FORTRAN carriage control interpretation on 

unit 6. 

-vms_endfile Causes a VMS endfile record to be written when an \ 

ENDFILE statement is executed and allows subsequent 
reading from an input file after an endfile record is 
encountered. 

-vms jstdin Allows rereading from stdin after EOF has been 

encountered. 

-w66 Suppresses FORTRAN 66 compatibility warning 

messages. , 

Table 1-2. Compiler Flags 

1.2.1 Debugging 

The compiler system provides a source level, interactive debugger called dbx 

that you can use to debug programs as they execute. With dbx you can control 

program execution to set breakpoints, monitor what is happening, modify f 

values, and evaluate results, dbx keeps track of variables, subprograms, \ 

subroutines, and data types in terms of the symbols used in the source language. 

You can use this debugger to access the source text of the program, to identify 

and reference program entities, and to detect errors in the logic of the program. 

Reference Information 

For a complete list of -g driver options, see/77(l) manual page in the 
FORTRAN 77 Reference Manual Pages; see dbx(l) in the IRIX User's Reference 
Manual for information on the debugger. For a complete description, see the 
dbx Reference Manual. 



( 
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1.2.2 Profiling 

The compiler system permits the generation of profiled programs that, when 
executed, provide operational statistics. This is done through driver option -p 
(which provides pc sampling information) and the pixie and prof programs. 

A variety of options and methods of profiling are available; if you wish to learn 
more about them, read Chapter 2 of the IRIX Series Compiler Guide, which 
describes the advantages and methods of profiling, and gives examples of the 
various options and commands to achieve the desired results. Seeprofll) in the 
User' s Reference Manual for detailed reference information. 

1.2.3 Optimizing 

The default optimizing option (-Ol) causes the code generator and assembler 
phases of compilation to improve the performance of your executable object. 
You can prevent optimization by specifying -O0. 

The table below summarizes the optimizing functions available: 



Option Result _______ 

-03 Performs all optimizations, including global register allocation. 

With this option, a ucode object file is created for each FORTRAN 
source file and left in a .u file. The newly created ucode object 
files, the ucode object files specified on the command lines, the 
runtime startup routine, and all of the runtime libraries are ucode 
linked. Optimization is done globally on the resulting ucode linked 
file and then it is linked as normal, producing an a.out file. No .o 
is left from the ucode linked result, -c cannot be specified with 
-03. 

-02 The global optimizer (uopt) phase executes. It performs 

optimization only within the bounds of individual compilation 
units. 

-Ol Default option. The code generator and the assembler perform 

basic optimizations in a more limited scope. 

-O0 No optimization. 

Table 1-3. Optimizer Flags 
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The default option (-Ol) causes the code generator and the assembler to 
perform basic optimizations such as constant folding, common subexpression 

elimination within individual statements, and common subexpression 

elimination between statements. 

The global optimizer— invoked by the -02 option— is a single program that 
improves the performance of an dbject program by transforming existing code 
into more efficient coding sequences. Although the same optimizer processes all 
compter optimizations, it does distinguish between the various languages 
supported by the compiler system programs to take advantage of the different 
language semantics involved 

See the IRIX Series Compiler Guide for details on the optimization techniques 

used by the compiler and tips on writing optimal code for optimizer processing. 

1.2.4 Performance 

In addition to optimizing options, the compiler system provides other options 

that can improve the performance of your programs: 

• The -feedback and -cord options (see/77(l) in the User 9 s Reference 
Manual) together with the pixk(l) and profit) utilities can be used to 
reduce possible machine cache conflicts. See deducing Cache Conflicts" 
in Chapter 4 for an example using these facilities. 

The link editor -G man and -bestG num options offer means of controlling 
the size of the global data area, which can produce significant performance 
improvements. See Chapter 2 of the IRIX Series Compiler Guide and ld{\) 
in the Usefs Reference Manual for more information. 

• The -jmpopf option permits the link editor to fill certain instruction delay 
slots not filled by the compiler front end. This option may improve the 
performance of smaller programs not requiring extremely large blocks of 
virtual memory. See ld(l) for more information. 



( 
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1.3 Object File Tools 

The following tools provide information on object files as indicated: 

odwnp: The odump tool lists headers, tables, and other selected parts of an 

object or archive file. An explanation of the information provided by 
odump can be found in Chapters 10 and 1 1 of the Assembly 
Language Programmer 1 s Guide. 

stdump: The stdump tool lists intermediate-code symbolic information for 
object files, executables, or symbolic information files. 

nm: The nm tool prints symbol table information for object files and 

archive files. 

file: The file tool lists the properties of program source, text, object, and 

other files. This tool often erroneously recognizes command files as 
C programs. It does not recognize Pascal or LISP programs. 

size: The size tool prints information about the text, rdata, data, sdata, bss, 

and sbss sections of the specified object or archive file(s). The 
contents and format of section data are described in Chapter 10 of the 
Assembly Language Programmer's Guide. 

For more information on these tools, see odump(l), stdump(l), nm(l),file(l), or 
size(l) in the User's Reference Manual. 
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1.4 Archiver 

An archive library is a file that contains one or more routines in object (.o) file V 

format. The term object as used in this chapter refers to an .o file that is part of 

an archive library file. When a program calls an object not explicitly included in 

the program, the link editor Id) looks for that object in an archive library. The 

editor then loads only that object (not the whole library) and links it with the 

calling program. 

The archiver (ar) creates and maintains archive libraries and has the following 
main functions: 

Copying new objects into the library 

• Replacing existing objects in the library 
Moving objects about the library 

• Copying individual objects from the library into individual object files 

See ar(l) in the User's Reference Manual for additional information on the V 

archiver. 



1.5 Runtime Considerations 
1 .5.1 Invoking a Program 

To run a FORTRAN program, invoke the executable object module produced by 
the/77 command by entering the name of the module as a command. By 
default, the name of the executable module is a.out. If you included the -o 
filename option on the Id (or/77) command line, the executable object module 
has the name that you specified. 



( 
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1.5-2 File Formats 

FORTRAN supports five kinds of external files: 

sequential format 

sequential unformatted 

direct formatted 

direct unformatted 

key indexed file 

The operating system implements other files as ordinary files and makes no 
assumptions about their internal structure. 

FORTRAN I/O is based on records. When a program opens a direct file or key 
indexed file, the length of the records must be given. The FORTRAN I/O 
system uses the length to make the file appear to be made up of records of the 
given length. When the record length of a direct file is 1, the system treats the 
file as ordinary system files (as byte strings, in which each byte is addressable). 
A read or write request on such files consumes bytes until satisfied, rather than 
restricting itself to a single record. 

Because of special requirements, sequential unformatted files will probably be 
read or written only by FORTRAN I/O statements. Each record is preceded and 
followed by an integer containing the length of the record in bytes. 

During a READ, FORTRAN I/O breaks sequential unformatted files into 
records by using each newline indicator as a record separator. The FORTRAN 
77 American National Standard does not define the required result after reading 
past the end of a record; the I/O system treats the record as being extended by 
blanks. On output, the I/O system writes a newline indicator at the end of each 
record. If a user program also writes a newline indicator, the I/O system treats it 
as a separate record. 



Version 2.0 Compiling, Linking, and Running Programs 1-19 



( 



1.5.3 Preconnected Files 

The following table show preconnected files at program start. 

Unit # Unit 

5 Standard Input 

6 Standard Output 
Standard Error 

Table 1-4, Preconnected Files 

All other units are also preconnected when execution begins. Unit n is 
connected to a file named fort.w. These files need not exist, nor will they be 
created unless their units are used without first executing an open. The default 
connection is for sequentially formatted I/O. 

1.5.4 File Positions 

The FORTRAN 77 standard does not specify where open should initially | 

position a file explicitly opened for sequential I/O. The I/O system positions the V. 

file to start-of-file, both for input and output. The execution of an open 
statement followed by a write on an existing file causes the file to be 
overwritten, erasing any data in the file. In a program called from a parent 
process, Units 0, 5, and 6 are positioned by the parent process. 

1 .5.5 Unknown File Status 

When the parameter status=unknown is specified in an open statement, the 
following occurs: 

If the file doesn't already exist, it is created and positioned at start-of-file. 

If the file already exists, it is opened and any data in the file is truncated. 



( 
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1 .5.6 Runtime Error Handling 

When the FORTRAN runtime system detects an error, the following action takes 
place: 

A message describing the error is written to the standard error unit (Unit 0). 
See Appendix A for a list of the error messages. 

• A core file is produced, which can be used with dbx or edge to inspect the 
state of the program at termination. For more information, see the dbx 
Reference Manual and edge(l). 

To invoke dbx using the core file, enter the following: 

% dbx binary- file core 

where binary-file is the name of the object file output (the default is a.out). For 
more information on dbx, see "Debugging" in Section 1.2.1 of this manual. 

1.5.7 Trap Handling 

The library libfpe.a provides two methods for handling floating point 
exceptions: the subroutine handle _sigfpes(, and the environment variable 
trap_fpe. Both methods provide mechanisms for handling and classifying 
floating point exceptions, and for substituting new values. They also provide 
mechanisms to count, trace, exit or abort on enabled exceptions. See the man 
page on handle _sigf pes for more information. 
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2. Storage Mapping 



This chapter contains two sections: 

Alignment, Size, and Value Ranges, which describes how the FORTRAN 
compiler implements size and value ranges for various data types as well as 
how data alignment occurs under normal conditions. 

Access of Misaligned Data, which describes two methods of accessing 
misaligned data. 
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2.1 Alignment, Size, and Value Ranges 



Table 2-1 contains information pertaining to various data types. 



( 



Type 


Synonym 


Size 


Alignment 


Value Range 


byte 


integer*1 


8 bits 


byte 


»128... 127 


integer*2 




16 bits 


halfword 1 


»32,768 ... 32,767 


integer 


integerM 4 


32 bits 


word 2 


-2 31 ... 2 31 -1 


logical* 1 




8 bits 


byte 


0... 1 


logical*2 




16 bits 


halfword 1 


0... 1 


logical 


logical*4 5 


32 bits 


word 2 


0... 1 


real 


realM 


32 bits 


word 2 


See Note 1 . 


dbl precision 


real*8 


64 bits 


doubleword 3 


See Note 1 . 


complex 


complex*8 


64 bits 


word 2 




dbl complex 


128 bits 


doubleword 3 




character 




8 bits 


byte 


-128... 127 



Byte boundary divisible by two. 

Byte boundary divisible by four. 
3 
Byte boundary divisible by eight. 

When -12 option is used, type "integer" would be equivalent to integer*2. 

When -i2 option is used, type "logical "would be equivalent to logical*2- 



( 



Table 2-1. Size, Alignment, and Value Ranges of Data Types 



See the following notes for more details on some of the items in the table. 



( 



2-2 FORTRAN 77 Programmer's Guide 



IRIS-4D Series 



1. Approximate valid ranges for real and double are: 
real double 

maximum 3.40282356 * 10 38 1.7976931348623158 * 10 308 

minimum normalized 1.17549424 * 10 _38 2.2250738585072012 * 10 " 308 
minimum denormalized 1 .401 29846 MO" 46 2.225073858507201 2*10 " 308 

Note: When the compiler encounters a real* 1 6 declaration, it issues a 
warning message, real* 1 6 items are allocated 16 bytes of storage per 
element, but only the first eight bytes of each element are used. Those eight 
bytes are interpreted according to the format for real* 8 floating numbers. 

When the compiler encounters a real* 1 6 constant in a source program, the 
compiler issues a warning message. The constant is treated as a double 
precision (real* 8) constant, real* 16 constants have the same form as 
double precision constants, except that the exponent indicator is Q instead 
ofD. 

2. Table 2-1 states that double precision variables always align on a 
doubleword boundary. However, FORTRAN permits these variables to 
align on a word boundary if a common statement or equivalencing requires 
this. 

3. Forcing integer, logical, real, and complex variables to align on a 
half word boundary is not allowed, except as permitted by the -align8, 
-alignl6, and -align32 command line options. See Chapter 1. 

4. A complex data item is an ordered pair of real numbers; a double-complex 
data item is an ordered pair of double-precision numbers. In each case, the 
first number represents the real part and the second represents the imaginary 
part. 

5. logical data items denote only the logical values TRUE and FALSE 
(written as . true . or . false.). However, to provide VMS compatibility, 
logical*! variables can be assigned all values in the range -128 to 127. 
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6. You must explicitly declare an array in a dimension declaration or in a data 
type declaration. The compiler follows these rules for dimension support: 

• Allows up to seven dimensions 

Assigns a default of 1 to the lower bound if a lower bound is not 
explicitly declared in the dimension statement 

Creates an array the size of its element type times the number of 
elements 

Stores arrays in column-major mode 

7. The following rules apply to shared blocks of data set up by the common 
statements: 

• The compiler assigns data items in the same sequence as they appear in 
the common statement(s) defining the block. Padding of data items 
will occur according to the alignment switches or the default compiler. 
See Section 2.2 for more information. 

You can allocate both character and non-character data in the same 
common block. 

When a common block appears in multiple program units, the compiler 
allocates the same size for that block in each unit, even though the size 
required may differ (due to varying element names, types and ordering 
sequences) from unit to unit. The size allocated corresponds to the 
maximum size required by the block among all the program units. 



( 



( 



( 



2-4 FORTRAN 77 Programmer's Guide 1RIS-4D Series 



2.2 Access of Misaligned Data 

The FORTRAN compiler allows misalignment of data if specified by the use of 
special options. 

As discussed in the previous section, the architecture of the IRIS-4D series 
assumes a particular alignment of data. ANSI standard FORTRAN 77 cannot 
violate the rules governing this alignment. Common extensions to the dialect, 
particularly small integer types, allowing intermixing of character and non- 
character data in common and equivalence statements, and mismatching the 
types of formal and actual parameters across a subroutine interface, provide 
many opportunities for misalignment to occur. 

Code using the extensions which compiled and executed correctly on other 
systems with less stringent alignment requirements may fail during compilation 
or execution on the IRIS-4D. This section describes a set of options to the 
FORTRAN compilation system that allows the compilation and execution of 
programs whose data may be misaligned. Be forewarned that the execution of 
programs that use these options will be significantly slower than the execution of 
a program with aligned data. 

The two methods described below can be used to create an executable object file 
that accesses misaligned data. 

2.2.1 Method One 

Use the first method if the number of instances of misaligned data access is very 
small, or to provide information on the occurrence of such accesses so that 
misalignment problems can be corrected at the source level. 

This method catches and corrects bus errors due to misaligned accesses. This 
ties the extent of program degradation to the frequency of these accesses. This 
method also includes capabilities for producing a report of these accesses, to 
enable their correction. 

To use this method, you must inhibit the FORTRAN front-end from padding 
data to force alignment. This may be done by compiling your program with one 
of two switches to/77(l). 
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Use the switch -Wf, -align8 if your program expects no restrictions on 
alignment. 

Use the switch -Wf, -alignl6 if your program expects to be run on a machine g- 

that requires half-word alignment. You must also use the misalignment trap | 

handler. This requires minor source code changes to initialize the handler, and - 
the addition of the handler binary to the link step ($eefixade(3f)). 

2.2.2 Method Two 

Use the second method for programs with widespread misalignment or whose 
source may not be modified. 

In this method, a set of special instructions is substituted by the IRIS-4D 
assembler for data accesses whose alignment cannot be guaranteed. The 
generation of these more forgiving instructions may be opted for each source 
file. 

This method is invoke by the specification of one of the alignment switches 

(-align8, -alignlfi) to/77, when compiling any source file that references 

misaligned data (see/77(l)). If your program passes misaligned data to system 

libraries, you may also need to link it with the trap handler (see fixade(3f)). f 



( 
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3. FORTRAN Program Interfaces 



This chapter contains the following major sections: 

FORTRAN/C Interface, which describes the interface between 
FORTRAN routines and routines written in C. It contains rules and gives 
examples for making calls and passing arguments between the two 
languages. 

FORTRAN/C Wrapper Interface, which describes the process of 
generating wrappers for C routines called by FORTRAN. 

FORTRAN/Pascal Interface, which describes the interface between 
FORTRAN routines and routines written in Pascal. It contains rules and 
gives examples for making calls and passing arguments between the two 
languages. 

You may need to refer to other sources of information as you read this chapter. 

• For information on storage mapping — how the variables of the various 
languages appear in storage — refer to Chapter 2 of this manual for 
FORTRAN, and to Chapter 2 in the appropriate language Programmer's 
Guide for other languages. 

For information on the standard linkage conventions used by the compiler in 
generating code, see Chapter 7 of the Assembly Language Programmer' s 
Guide. 

For information on built-in functions that provide access to non-FORTRAN 
system functions and library routines, see Chapter 4 of this manual. 
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3.1 FORTRAN/C Interface 

3.1.1 Procedure and Function Declarations 

This section discusses items you should consider before calling C functions from 
FORTRAN. 

Names 

When calling a FORTRAN subprogram from C, the C program must append an 
underscore (_) to the name of the FORTRAN subprogram. For example, if the 
name of the subprogram is matrix, then call it by the name matrix _. When 
FORTRAN is calling a C function, the name of the C function must also end 
with an underscore. 

Note that only one main routine is allowed per program. The main routine can 
be written in either C or FORTRAN. Below is an example of a C and a 
FORTRAN main routine. 

C FORTRAN "" 

main () { write (6,10) 

printf ("hi!\n") ; 10 format ('hi!') 
} end 

Table 3-1 „ Main Routines 

Invocations 

Invoke a FORTRAN subprogram as if it were an integer- valued function whose 
value specifies which alternate return to use. Alternate return arguments 
(statement labels) are not passed to the subprogram but cause an indexed branch 
in the calling subprogram. If the subprogram is not a function and has no entry 
points with alternate return arguments, the returned value is undefined. The 
FORTRAN statement 

call nret (*1,*2,*3) 

is treated exactly as if it were the computed goto 

goto (1,2,3), nret() 
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A C function that calls a FORTRAN subprogram can usually ignore the return 
value of a FORTRAN subroutine; however, the C function should not ignore 
the return value of a FORTRAN function. The table below shows equivalent 
function and subprogram declarations in C and FORTRAN programs. 

C Function Declaration FORTRAN Function Declaration 

double dfort() double precision function dfort () 

double rfort() real function rfort() 

int ifort () integer function ifort() 

int Ifort logical function lfort () 

Table 3-2. Function Declarations 

Note the following: 

Avoid calling FORTRAN functions of type float, complex, and 

CHARACTER from C. 

You cannot write a C function so that it will return a complex value to 
FORTRAN. 

A character- valued FORTRAN subprogram is equivalent to a C language 
routine with two extra initial arguments: a data address and a length. 
However, if the length is one, no extra argument is needed and the single 
character result is returned as in a normal numeric function. 

Thus: 

character*15 function g(...) 

is equivalent to: 

char result [1] ; 
long int length; 
g_ (result, length, ...) 



and could be invoked in C by: 

char chars [15] 
g ( chars, 15, ...) ; 
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And: 

character function h(...) 

could be invoked in C by: 

char c, h () ; 
c-h (...); 



( 



C 



3.1.2 Arguments 

The following rules apply to arguments passed between FORTRAN and C: 

1 . All explicit arguments must be passed by reference. All routines must 
specify an address rather than a value. Thus, to pass constants or 
expressions to FORTRAN, the C routine must first store their values into 
variables and then pass the address of the variable. (The only exception 
occurs when passing the length of a string from C to a FORTRAN 
subroutine with a parameter of type character.) 

2. When passing the address of a variable, the data representations of the 
variable in the calling and called routines must correspond, as shown in 
Table 3-1. 

______ _ ^_ 

integer* 2 x short int x; 

integer x long int x; or just int x; 

logical x long int x; or just int x; 

real x float x; 

double precision x double x; 

complex x struct {float real, imag; ) x; 

double complex x struct {double dreal r dimag; } x; 

character* 6 x char x[6] ^ 

The length of the array must also be passed., as discussed in the next section. 

Table 3-1 . Equivalent FORTRAN and C Data Types | 

Note that in FORTRAN, integer and logical variables occupy 32 bits of 
memory by default, but this can be changed by using the -12 option. 
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3. The FORTRAN compiler may add items not explicitly specified in the 
source code to the argument list The compiler adds the following items 
under the conditions specified: 

• Destination address for character functions, when called. 

Length of a character variable, when an argument is the address of a 
character variable. 

When a C function calls a FORTRAN routine, the C function must explicitly 
specify these items in its argument list in the following order; 

1. If the FORTRAN routine is a function that returns a character variable of 
length greater than 1, specify the address and length of the resultant 
character variable. 

2. Normal arguments (addresses of arguments or functions). 

3. The length of each normal character parameter in the order it appeared in 
the argument list. The length must be specified as a constant value or 
integer variable (i.e., not an address). 

The examples on the following pages illustrate these rules. 

Example 1 : The following example shows how a C routine must specify the 
length of a character string (which is only implied in a FORTRAN call). 

"C~" FORTRAN CALL TO SAM, A ROUTINE WRITTEN 
C IN FORTRAN 

EXTERNAL F 
CHARACTER*? S 
INTEGER B(3) 

CALL SAM (F, B(2) , S) 



\* C CALL TO SAM *\ 
INT F(); 
CHAR S [ 7 ] ; 
LONG INT B[3] ; 

SAM (F, &B[1], S, 7); 
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Example 2: The following example shows how a C routine can specify the 
destination address of a FORTRAN function (which is only implied in a 
FORTRAN program). 

"c FORTRAN CALL TO F, A FUNCTION WRITTEN — 

C IN FORTRAN 

EXTERNAL F 
CHARACTER*? F, G 
G - F() 



\* C CALL TO F *\ 

CHAR S[10J; 
F (&S, 10); 



FUNCTION F, WRITTEN IN FORTRAN 

CHARACTER* 10 FUNCTION F() 

F -. ^0123456789' 

RETURN 

END 



3.1.3 Array Handling 

FORTRAN stores arrays in column-major order with the leftmost subscript 
varying the fastest. C, however, stores arrays in the opposite arrangement, with 
the rightmost subscripts varying the fastest, which is called row-major order. 
Here's how the layout of the FORTRAN arrays and C arrays looks: 

FORTRAN _____^^ 

integer t (2,3) 

t(l,l), t(2,l), t(l,2), t(2,2) f t(l,3), t(2,3) 



int t [2] [3] 

t[0][0], t[0][l], t[0][2], t[l][0], t[l][0], t[l][l], t[l][2] 

Note that the default for the lower bound of an array in FORTRAN is 1, where 
the default in C is 0. 



( 



( 



( 
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When a C routine uses an array passed by a FORTRAN subprogram, the 
dimensions of the array and the use of the subscripts must be interchanged, as 
shown in the following example. 



FORTRAN caller: 



C called routine: 




A. Dimensions and subscripts are reversed. 



B. 1 is subtracted from the indices, 
j and i are pointers to integers. 

Figure 3-1. Array Subscripts 

The FORTRAN caller prints out the value 99. Note the following: 

A. Because arrays are stored in column-major order in FORTRAN and row- 
major order in C, the dimension and subscript specifications are reversed. 

B. Because the lower-bound default is 1 for FORTRAN and for C, 1 must be 
subtracted from the indices in the C routine. Also, because FORTRAN 
passes parameters by reference, the *j and *p are pointers used in the C 
routine. 
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3.1.4 Accessing Common Blocks of Data 

The following rules apply to accessing common blocks of data: 

• FORTRAN common blocks must be declared by common statements; C 
can use any global variable. Note that the common block name in C (samj 
must end with an underscore. 

Data types in FORTRAN and C programs must match unless you desire 
equivalencing. If so, you must adhere to the alignment restrictions for the 
data types described in Chapter 2. 

If the same common block is of unequal length, the largest of the sizes is 
used to allocate space. 

Unnamed common blocks are given the name J3LNK_ 

The following gives examples of C and FORTRAN routines that access 
common blocks of data. 

FORTRAN 

subroutine sam() 
common /r/ i, r 
i - 786 
r - 3.2 
return 



struct S {int i; float j;}}r_; 

main () { 

sam () ; 

print£P%d %f \n", r_.i, r_. j) ; 

} 

The C routine prints out 786 and 3.2. 



( 



( 



( 
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When a C routine uses an array passed by a FORTRAN subprogram, the 
dimensions of the array and the use of the subscripts must be interchanged, as 
shown in the following example. 



FORTRAN caller: 



C called routine: 




= 99; 



A, Dimensions and subscripts are reversed 



B. 1 is subtracted from the indices, 
j and i are pointers to integers. 

Figure 3-1. Array Subscripts 

The FORTRAN caller prints out the value 99. Note the following: 

A. Because arrays are stored in column-major order in FORTRAN and row- 
major order in C, the dimension and subscript specifications are reversed. 

B. Because the lower-bound default is 1 for FORTRAN and for C, 1 must be 
subtracted from the indices in the C routine. Also, because FORTRAN 
passes parameters by reference, the *j and *p are pointers used in the C 
routine. 
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3.1.4 Accessing Common Blocks of Data 

The following rules apply to accessing common blocks of data: 

FORTRAN common blocks must be declared by common statements; C 
can use any global variable. Note that the common block name in C (samj 
must end with an underscore. 

Data types in FORTRAN and C programs must match unless you desire 
equivalencing. If so, you must adhere to the alignment restrictions for the 
data types described in Chapter 2. 

• If the same common block is of unequal length, the largest of the sizes is 
used to allocate space. 

Unnamed common blocks are given the name _BLNK_ b 

The following gives examples of C and FORTRAN routines that access 
common blocks of data. 

FORTRAN ___ 

subroutine sam() 
common /r/ i, r 
i - 786 
r = 3.2 
return 



struct S {int i; float j;}}r_; 

main () { 

sam () ; 

printf( xx %d %f \n", r__.i, r_. j) ; 

} 

The C routine prints out 786 and 3.2. 



( 



( 



( 
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3.2 FORTRAN/C Wrapper Interface 

This section describes the process of generating wrappers for C routines called 
by FORTRAN. If you want to call existing C routines (which use value- 
parameters rather than reference parameters) from FORTRAN, these wrappers 
convert the parameters during the call. The program mkflc provides an alternate 
interface for C routines called by FORTRAN. 

FORTRAN routines called by C must use the method described in section 3.1. 

3-2.1 The Wrapper Generator mkf2c 

The mkflc program uses C data type declarations for parameters to generate the 
correct assembly language interface. In generating a FORTRAN-callable entry 
point for an existing C-callable function, the C function is passed through mkflc, 
and mkflc adds additional entry points. Native language entry points are not 
altered. 

These rules are used with mkflc: 

1. Each function given to mkflc must have the standard C function syntax. 

2. The function body must exist but may be empty. Function names are 
transformed as necessary in the output. 

3. By default, mkflc restricts FORTRAN entry points to six characters. This 
default may be overridden. (See mkflcil)). 

This is the simplest case of using a function as input to mkflc: 

func () 
{} 

Here, the function func has no parameters. If mkflc is used to produce a 
FORTRAN-to-C wrapper, the FORTRAN entry is/wnc__. func_ simply calls the 
C routine func(). 
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Here is another example: 

simplefunc (a) 

int a; 

{} 

In this example, the function simplefunc has one argument, a. The argument is 
of type int. For this function, mkf2c produces three items: a FORTRAN entry, 
simple, and two pieces of code. The first piece of code de-references the address 
of a, which was passed by FORTRAN. The second passes the resulting int to C. 
It then calls the C routine simplefunc(). 

Note: Underscores are valid characters in C function names. This is not true 
in FORTRAN. By default, the wrapper generator mkf2c removes 
underscores and other illegal character from the FORTRAN entry 
points they generate, and prints a warning message. This default may 
be overridden (see rnkf2c (1)). 

3.2-2 Using FORTRAN Character Variables as 
Parameters 



You may specify the length of a character variable passed as a parameter to 
FORTRAN either at compilation or at run time. The length is determined by the 
declaration of the parameter in the FORTRAN routine. If the declaration 
contains a length, the passed length must match the declaration. For example, in 
the following declaration, the length of the string is declared to be 10 characters: 

character*10 string 

The passed length must be 10 in order to match the declaration. 

When this next declaration is used, the passed length is taken for operations 
performed on the variable inside the routine: 

character* (*) string 

The length can be retrieved by use of the FORTRAN intrinsic function len. 
Substring operations may cause FORTRAN run-time errors if they do not check 
this passed length. 



( 



( 



c 
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Arrays of character variables are treated by FORTRAN as simple byte-arrays, 
with no alignment of elements. The length of the individual elements is 
determined by the length passed at run time. For instance, the array sarrayO 
may be declared in this manner: 

character* (*) sarray () 

This length is necessary to compute the indices of the array elements. The 
program mkf2c has special constructs for dealing with the lengths of FORTRAN 
character variables. 

3-2.3 Reduction of Parameters 

The program mkflc reduces each parameter to one of seven simple objects. The 
following list explains each object. 

64-bit value 

The quantity is loaded indirectly from the passed address and the result is 
passed to C. Parameters with the C type double (or long float) are reduced 
to 64-bit values by converting the 32-bit FORTRAN real parameter to 
double precision (see below). 

32 -bit value 

mkf2c uses the passed address to retrieve a 32-bit data value. This data 
value is passed to C. Parameters with C types int and long are reduced to 
32-bit values. Any parameter whose type is unspecified is assumed to be 
int. If the ~f option is specified, parameters with the C type float are also 
reduced to 32-bit values. 

16-bit value 

A 16-bit value is loaded using the passed address. The value is either 
extended or masked (depending on whether its type in the function 
parameter list is specified as signed or unsigned) and passed to C. Any 
parameter whose C type is short is reduced to a 16-bit value. 
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8-hit value 

The char type in C corresponds to the character* 1 type in FORTRAN 77. 
(There is no mechanism to pass integer* l variables to C A pointer to the 
value can be passed by declaring the parameter as m**). By default the 
character value is loaded as an unsigned quantity and passed to C. If the 
-signed option has been specified when invoking m£/2c, the character value 
is sign-extended before being passed to C. 

character string 

A copy is made of the FORTRAN character variable, it is null-terminated, 
and it is passed as a character pointer to C. Any modifications that C makes 
to the string will not affect the corresponding character variable in the 
FORTRAN routine. 

character array 

When using mkf2c to call C from FORTRAN, the address of the FORTRAN 
character variable is passed. This character array can be modified by C. It 
is not guaranteed to be null-terminated. The length of the FORTRAN 
character variable is treated differently (as discussed in the next section). 

pointer 

The value found on the stack is treated as a pointer, and is passed without 
alteration. Any array or pointer that is not of type char, any parameter with 
multiple levels of indirection, or any indirect array is assumed to be of type 
pointer. If the type of a parameter is specified, but is not one of the standard 
C types, rnkf2c will pass it as a pointer. 



( 



( 



( 
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Below is an example of a C specification for a function: 

TEST (I,S,C, PTR1 , AR1 , U, F , D , Dl , STR1 , STR2 , STR3 ) 

SHORT S; 

UNSIGNED CHAR C; 

INT *PTR1; 

CHAR *PTR2[]; 

SHORT AR1 [ ] ; 

S0METYPE U; 

FLOAT F; 

LONG FLOAT D, *D1; 

CHAR *STR1; 

CHAR STR2[] , STR3[30]; 

{ 

/* THE C FUNCTION BODY MAY GO HERE. NOTHING 

EXCEPT THE OPENING AND CLOSING BRACES ARE 

NECESSARY */ 
} 

If this function were passed to mkflc, the parameters would be transformed as 
follows: 

• ptrl , ptr2 , arl , dl ,and u would be passed as simple pointer. 

mkflc would complain about not understanding the type sometype, but, by 
default, would assume it to be of type pointer. 

s, c, and d would be passed as values of length 16 bits, 64 bits, and 8 bits, 
respectively, /would be converted to a 64-bit double prior to being passed, 
unless the -f option had been specified. If the -f option had been specified,/ 
would be passed as a 32-bit value. Since the type of i is not specified, it 
would be assumed to be int and would also be passed as a 32-bit value. 
Storing values in any of these parameters would not have any affect on the 
original FORTRAN data. 
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3.Z4 FORTRAN Character Array Lengths 

When the wrapper generator is used, a character variable that is specified as 

char* in the C parameter list is copied and null-terminated. C may thus f" 

determine the length of the string by the use of the standard C function strlen. \ 

If a character variable is specified as a character array in the C parameter list, the 
address of the character variable is passed, making it impossible for C to 
determine its length, as it is not null-terminated. When the call occurs, the 
wrapper code receives this length from FORTRAN. For those C functions 
needing this information, the wrapper passes it by extending the C parameter 
list. 

For example, if the C function header is specified as follows: 

fund (carrl, i, str, j, carr2) 
char carrl [], *str, carr2 [] ; 
int i, j; 
{} 

mkf2c will pass a total of seven parameters to G. The sixth parameter will be the 

length of the FORTRAN character variable corresponding to carrl , and the ^ 

seventh will be the length of carrl . The C f unction fund () must use the varargs i 

macros to retrieve these hidden parameters, mkflc will ignore the varargs ^ 

macro va_alist appearing at the end of the parameter name list, and its 

counterpart va_dci appearing at the end of the parameter type list. In the case 

above, use of these macros would produce the function header: 

♦include "varargs.h" 

fund (carrl, i, str, j, carr2, va__al± st ) 

char carrl [], *str, carr2[]; 

int i , j ; 

va_dci 

{} 

The C routine could retrieve the lengths of carrl and carrl, placing them in the 
local variables carrl Jen and carrl Jen by the following code fragment: 

va__list ap; 

int carrl_len, carr2_len; 

va_start (ap) ; jf" 

carrl__len « va_arg (ap, int) I 

carr2_len ~ va_arg (ap, int) 
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3.2.5 Using mkf2c and extcentry 

mkflc understands only a limited subset of the C grammar. This subset includes 
common C syntax for function entry point, C-style comments, and function 
bodies. However, it cannot understand constructs such as typedefs, external 
function declarations, or C preprocessor directives. 

Therefore, to insure that only those constructs understood by mkflc are included 
in wrapper input, you need to place special comments around each function for 
which FORTRAN-to-C wrappers are to be generated (see example on next 
page). 

Once these special comments, /* gentry */ and /* endcentry */, are placed 
around the code, use the program excentry(l) prior to mkflc to generate the input 
file for m£/2c. 

To illustrate the use of extcentry, the C file/<90.c is shown below. It contains the 
function/00, which is to be made FORTRAN-callable. 

typedef unsigned short grunt [4]; 
struct { 

long 1,11; 

char *str; 
} bar; 
main () 
{ 

int kappa =7; 

foo (kappa, bar. str) ; 
} 

/* GENTRY */ 
foo (integer, cstring) 
int integer; 
char *cstring; 
{ 

if (integer==l) printf ("%s", cstring) ; 
} /* ENDCENTRY */ 

The special comments /* centry */ and /* endcentry */ surround the 
section that is to be made FORTRAN-callable. 
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To generate the assembly language wrapper foowrp.s from the above file foo.c t 
use the following set of commands: 

extcentry foo.c foowrp.fc 
mkf2c foowrp.fc foowrp.s 

The programs mkflc and extcentry are found in the directory lusrlbin on your 
workstation. 

3.2.6 Makefile Considerations 

make{\) contains default rules to help automate the control of wrapper 
generation. The following example of a makefile illustrates the use of these 
rules. In the example, an executable object file is created from the files main.f 
(a FORTRAN main program) and called 

test: main.o calico 

f77 -o test main.o calico 
calico: callcfc 
clean: 

rm -f *.o test *.fc 

In this program, main calls a C routine in callc.c The extension .fc has been 
adopted for FORTRAN-to-call-C wrapper source files. The wrappers created 
from callcfc will be assembled and combined with the binary created from 
callc.c. Also, the dependency of calico on callcfc will cause callcfc to be 
recreated from callc.c whenever the C source file changes. (The programmer is 
responsible for placing the special comments for extcentry in the C source as 
required.) 

Note: Options to mflc may be specified when make is invoked by setting the 
make variable F@CFLAGS. Also, you should not create a fc file for 
the modules that need wrappers created. These files are both created 
and removed by make{\) in response to the file, o :file fc dependency. 



( 



( 



( 
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The makefile above will control the generation of wrappers and FORTRAN 
objects. Additional modules may be added to the executable object file in one of 
the following ways: 

If the file is a native C file whose routines are not to be called from 
FORTRAN using a wrapper interface, or if it is a native FORTRAN file, 
add the .o specification of the final make target and dependencies. 

If the file is a C file containing routines to be called from FORTRAN using 
a wrapper interface, the comments for extcentry must be placed in the C 
source, and the .o file placed in the target list. In addition, the dependency 
of the .0 file on the fc file must be placed in the makefile. This dependency 
is illustrated in the example makefile above where calif. o depends on 
callffc. 



3.3 FORTRAN/Pascal Interface 



This section discusses items you should consider before writing a call between 
FORTRAN and Pascal. 

3.3.1 Procedure and Function Declarations 

Names 

In calling a FORTRAN program from Pascal, you must place an underscore (J 
as a suffix to routine names and data name. 

To call FORTRAN from Pascal or vice versa, specify an underscore (_) as the 
suffix of the name of the FORTRAN or Pascal routine being called. For 
example, if the routine is called matrix, then call it by the name matrix _. 

In Pascal, always declare the external FORTRAN subprogram or function with 
VAR parameters. 



Version 2.0 FORTRAN Program Interfaces 3-17 



Note that only one main routine is allowed per program. The main routine can 
be written either in Pascal or FORTRAN. Below is an example of a Pascal and 
a FORTRAN main routine. 



Pascal 



FORTRAN 



program p; 
begin 

writelnrhi!"); 
end. 



write (6,10) 
format (*hi!') 
stop 
end 



Table 3-4„ Main Routines 



Invocation 



If you have alternate return labels, you can invoke a FORTRAN subprogram as 
if it were an integer- valued function whose value specifies which alternate return 
to use. Alternate return arguments (statement labels) are not passed to the 
function but cause an indexed branch in the calling subprogram. If the 
subprogram is not a function and has no entry points with alternate return 
arguments, the returned value is undefined. The FORTRAN statement 

call nret (*1,*2,*3) 

is treated exactly as if it were the computed goto 

goto (1,2,3), nret() 

A Pascal function that calls a FORTRAN subroutine can usually ignore the 
return value. The table below shows equivalent function declarations in Pascal 
and FORTRAN. 



( 



( 



Pascal 



FORTRAN 



function dfort___() : double; double precision function dfort() 
function rfort__() : real; real function rfort() 
function ifort_() ; integer; integer function ifort () 

Table 3-5. Function Declarations 

FORTRAN has a built-in data type complex that does not exist in Pascal. 
Therefore, there is no compatible way of returning these values from Pascal. 



( 
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A character- valued FORTRAN function is equivalent to a Pascal language 
routine with two initial extra arguments: a data address, and a length. 

The following FORTRAN statement: 

character*15 function g (...) 

is equivalent to the Pascal code: 

type string = array [1..15]; 
var 

length: integer; 

a : array[1..15] of char; 
procedure g_(var a: string; length: integer;...) ; external; 

and could be invoked by the Pascal line: 

g_ (a, 15); 
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3.3.2 Arguments 



The following rules apply to argument specifications in both FORTRAN and 
Pascal programs: 

1 . All arguments must be passed by reference. That is, the argument must 
specify an address rather than a value. Thus, to pass constants or 
expressions, their values must first be stored into variables, and then the 
addresses of the variables passed. 



( 



2. When passing the address of a variable, the data representations of the 
variable in the calling and called routines must correspond, as shown in 
Table 3-2. 



Pascal 


FORTRAN 


integer 


integer*4, integer, logical 


cardinal, char, boolean, 
enumeration 


character 


real 


real 


double 


double precision 


procedure.. 


subroutine 


record 
r:real; 
hreal; 

end; 


complex 


record 

ndouble; 

hdouble; 
end; 


double complex 



( 



Table 3-6. Equivalent FORTRAN and Pascal Data Types 



Note that FORTRAN requires that each integer, logical, and real 
variable occupy 32 bits of memory. 

Functions of type integer, real, or double precision are 
interchangeable between FORTRAN and Pascal, and require no special 
considerations. 



( 
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3. The FORTRAN compiler may add items not explicitly specified in the 
source code to the argument list The compiler adds the following items 
under the conditions specified: 

Destination address for character functions, when called. 

Length of character strings, when an argument is the address of a 
character string. 

When a Pascal program calls a FORTRAN subprogram, the Pascal program 
must explicitly specify these items in its argument list in the following order: 

1 . Destination address of character function. 

2. Normal arguments (addresses of arguments or functions.) 

3. Length of character strings. The length must be specified as an absolute 
value or integer variable. The next two examples illustrate these rules. 

Example: The following example shows how a Pascal routine must specify the 
length of a character string (which is only implied in a FORTRAN call). 

FORTRAN call to SAM 



C SAM IS A ROUTINE WRITTEN IN FORTRAN 

EXTERNAL F 
CHARACTER* 7 S 
INTEGER B(3) 

call sam (F, B(l), s) <-- LENGTH OF S IS IMPLICIT. 
Pascal call to SAM 

PROCEDURE F__; EXTERNAL; 

S: ARRAY[1..7] OF CHAR; 
B: ARRAY [1.. 3] OF INTEGER; 

SAM (F, B[lj, s, 7); <- LENGTH OF S IS EXPLICIT. 
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3.3.3 Execution-Time Considerations 

Pascal checks certain variables for errors at execution time, whereas FORTRAN 
doesn't. For example, in a Pascal program, when a reference to an array exceeds £ 

its bounds, the error is flagged (if runtime checks aren't suppressed). Use the \ 

f77 -c flag if you want a FORTRAN program to detect similar errors when you 
pass data to it from a Pascal program. 

3.3.4 Array Handling 

FORTRAN stores arrays in column-major order, where the leftmost subscripts 
vary the fastest. Pascal, however, stores arrays in row-major order, with the 
rightmost subscript varying the fastest. Also, the default lower bound for arrays 
in FORTRAN is 1. Pascal has no default; the lower bound must be explicidy 
specified. Here is an example of the different layouts: 

FORTRAN 

integer t (2,3) 

t(l,l), t(2,l), t(l,2), t(2,2), t(l,3), t(2,3) 

Pascal ___ 

var t: array [1 . .2, 1 . . 3] of integer; 

t[l,l], t[l,2], t[l,3], t[2,lj, t[2,2j, t[2 f 3] 

When a Pascal routine uses an array passed by a FORTRAN program, the 
dimensions of the array and the use of the subscripts must be interchanged. The 
example below shows the Pascal code that interchanges the subscripts. 

In the following example, the FORTRAN routine calls the Pascal procedure p, 
receives the value 99, and prints it out 



( 



( 
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FORTRAN 

INTEGER A (2, 3) 
CALL P (A, 1, 3) 
WRITE (6,10) A(l,3) 
10 FORMAT (IX, 19) 

STOP 
END 

Pascal 



TYPE ARRY = ARRAY [1..3,1..2]; 

PROCEDURE P (VAR A: ARRY; VAR I, J: INTEGER) ; 



BEGIN 

A[I,J] := 99; 
END; 



In the next example, the Pascal routine passes the character string "0123456789" 
to the FORTRAN subroutine s_, which prints it out and then returns to the 
calling program. 



Pascal 



TYPE STRING - ARRAY [1.. 10] OF CHAR; 

PROCEDURE S_( VAR A: STRING; I: INTEGER); EXTERNAL; 

/* NOTE THE UNDERBAR */ 

PROGRAM S; 

VAR 

R: STRING; 
BEGIN 

R:« "0123456789"; 

S_(R,10); 
END. 



FORTRAN 



SUBROUTINE S (C) 
CHARACTER* 10 C 
WRITE (6,10) C 
10 FORMAT (6,10) C 
RETURN 
END 
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3.3.5 Accessing Common Blocks of Data 

The following rules apply to accessing common blocks of data: 

• FORTRAN common blocks must be declared by common statements; Pascal 
can use any global variable. Note that the common block name in Pascal 
(sarn_) must end with an underscore. 

• Data types in the FORTRAN and Pascal programs must match unless you 
desire implicit equivalencing. If so, you must adhere to the alignment 
restrictions for the data types described in Chapter 2. 

If the same common block is of unequal length, the largest of the sizes is 
used to allocate space. 

Unnamed common blocks are given the name _BLNK__, w here _ is the 
underscore character. 

Example: 

The following gives examples of FORTRAN and Pascal routines that access 
common blocks of data. 



c 



( 



( 
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Pascal 

VAR 

A: RECORD 

I : INTEGER; 
R : REAL; 
END; 
PROCEDURE SAM_; 

EXTERNAL; 
PROGRAM S; 

BEGIN 
A_.I :- 4; 
A_.R :- 5.3; 
SAM_; 
END. 

FORTRAN 



SUBROUTINE SAM() 
COMMON /A/I,R 
WRITE (6,10) I,R 
10 FORMAT (1X,I5,F5.2) 

RETURN 
END 



The FORTRAN routine prints out 4 and 5.30. 
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c 



( 



( 



4. System Functions and 
Subroutines 



This chapter describes extensions to FORTRAN 77 that are related to the IRIX 
operating and compiler system. 



4.1 Library Functions 

The following tables summarize the functions that are available in the 
FORTRAN run-time library. These function provide an interface from 
FORTRAN programs to the system in the same manner as the C library does for 
C programs. The compiler automatically loads an interface routine when it 
processes the associated call. 



Function 


Purpose 


abort 


abnormal termination 


access 


determine accessibility of a file 


acct 


enable/disable process accounting 


alarm 


execute a subroutine after a specified time 


barrier 


perform barrier operations 


blockproc 


block processes 


brk 


change data segment space allocation 


chdir 


change default directory 


chmod 


change mode of a file 


chown 


change owner 



Table 4-1. Summary of System Interface Routine Library 
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Function 


Purpose 


chroot 


change root directory for a command 


close 


close a file descriptor 


creat 


create or rewrite a file 


ctime 


return system time 


dtime 


return elapsed execution time 


dup 


duplicate an open file descriptor 


etime 


return elapsed execution time 


exit 


terminate process with status 


fcntl 


file control 


fdate 


return date and time in an ASCII string 


fgetc 


get a character from a logical unit 


fork 


create a copy of this process 


fputc 


write a character to a FORTRAN logical unit 


freej)arrier 


frees barrier 


fseek 


reposition a file on a logical unit 


fstat 


get file status 


ftell 


reposition a file on a logical unit 


gerror 


get system error messages 


getarg 


return command line arguments 


getc 


get a character from a logical unit 


getcwd 


get pathname of current working directory 


getdents 


read directory entries 


getegid 


get effective group ID 


getenv 


get value of environment variables 


geteuid 


get effective user ID 


getgid 


get user or group ID of the caller 


gethostname 


get current host ID 


getlog 


get user's login name 


getpgrp 


get process group ID 


getpid 


get process ID 



( 



( 



( 



Table 4-1 (continued). Summary of System Interface Routine Library 
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Function 


Purpose 


getppid 


get parent process ID 


getsockopt 


get options on sockets 


getuid 


get user or group ID of caller 


gmtime 


return system time 


iargc 


return command line arguments 


idate 


return date or time in numerical form 


ierrno 


get system error messages 


ioctl 


control device 


isatty 


determine if unit is associated with tty 


itime 


return date or time in numerical form 


loll 


send a signal to a process 


link 


make a link to an existing file 


loc 


return the address of an object 


lseek 


move read/write file pointer 


lstat 


get file status 


ltime 


return system time 


m_fork 


creates parallel processes 


m_get_myid 


get task ID 


m_getjiumprocs 


get number of subtasks 


m_kill_procs 


kill process 


mjock 


set global lock 


m_next 


return value of counter 


m_set_j)rocs 


set number of subtasks 


m_unlock 


unset a global lock 


mkdir 


make a directory 


mknod 


make a directory/file 


mount 


mount a filesystem 


new_barrier 


initializes a barrier structure 


nice 


lower priority of a process 


open 


open a file 



Table 4-1 (continued). Summary of System Interface Routine Library. 
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Function 


Purpose 


oserror 


gel/set system error 


pause 


suspend process until signal 


perror 


get system error messages 


pipe 


create an inter-process channel 


plock 


lock process, test, or data in memory 


prctl 


control processes 


profil 


execution time profile 


ptrace 


process trace 


putc 


write a characer to a FORTRAN logical unit 


putenv 


set environment variable 


qsort 


quick sort 


read 


read from a file descriptor 


readlink 


read value of symbolic link 


rename 


change the name of a file 


rmdir 


remove a directory 


sbrk 


change data segment space allocation 


send 


send a message to a socket 


setblockproccnt 


set semaphore count 


setgid 


set group ID 


sethostid 


set current hpst ID 


setoserror 


set system error 


setpgrp 


set process group ID 


setsockopt 


set options on sockets 


setuid 


set user ID 


shmdt 


shared memory 


sginap 


timed sleep function 


sighold 


raise priority and hold signal 


sigignore 


ignore signal 


signal 


change the action for a signal 


sigpause 


suspend until receive signal 



( 



( 



( 
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Function 


Purpose 


sigrelse 


release signal and lower priority 


sleep 


suspend execution for an interval 


socket 


create an endpoint for communication-TCP 


sproc 


create a new share group process 


stat 


get file status 


stime 


set time 


symlink 


make symbolic link 


sumlnk 


make symbolic link 


sync 


update superblock 


system 


issue a shell command 


taskblock 


block tasks 


taskcreate 


create a new task 


tasked 


control task 


taskdestroy 


kill task 


tasksetblockcnt 


set task semaphore count 


taskunblock 


unblock task 


time 1 


return system time 


ttynam 


find name of a terminal port 


uadmin 


administrative control 


ulimit 


get and set user limits 


umask 


get and set fie creation mask 


umount 


dismount a file system 


unblockproc 


unblock processes 


unlink 


remove a directory entry 


usconfig 


semaphore and lock configuration operations 


uscpsema 


acquires a semaphore 


uscsetlock 


unconditionally sets lock 


1 This library function time can only be invoked if it is declared in an external 
statement. Otherwise, it will be misinterpreted as the VMS compatible intrinsic 
subroutine time. 
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Function 


Purpose 


usctlsema 


semaphore control operations 


usfree 


user shared memory allocation 


usfreelock 


free a lock 


usfreesema 


free a semaphore 


usinit 


semaphore and lock initialize routine 


usinitlock 


initialize a lock 


usnewsema 


allocate and initialize a semaphore 


uspsema 


acquires a semaphore 


usrealloc 


user share memory allocation 


ussetlock 


set lock 


ustest lock 


test lock 


ustestsema 


return value of semaphore 


usunsetlock 


unset lock 


usvsema 


free a resource to a semaphore 


uswsetlock 


set lock 


wait 


wait for a process to terminate 


write 


write to a file 



( 



( 



Table 4-1 (continued). Summary of System Interface Routine Library 

You can display reference information on the functions shown in the figure 
using the man command in the following format: 

man 3f function 



( 
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4.2 Intrinsic Subroutine Extensions 

This section describes the intrinsic subroutines that are extensions to FORTRAN 
77. The following summarizes applicable rules for the subroutines. 

1. The subroutine names are specially recognized by the compiler. A user- 
written subroutine with the same name as a system subroutine must be 
declared in an external statement in the calling subprogram. 

2. Using a user- written subroutine with the same name as a system subroutine 
in one subprogram does not preclude using the actual system subroutine in a 
different subprogram. 

3. To pass the name of a system subroutine as an argument to another 
subprogram, the name of the system subroutine must be declared in an 
intrinsic statement in the calling subprogram. 

4. When a system subroutine name is passed as an argument to another 
subprogram, the call to the system subroutine via the formal parameter 
name in the receiving subprogram must use the primary calling sequence for 
the subprogram (when there is more than one possible calling sequence). 

The following table gives an overview of the system subroutines and their 
function; they are described in detail in the sections following the table. 



Subroutine 



Information Returned 



date Current date as nine-byte string in ASCII representation. 

i date Current month, day, and year, each represented by a 

separate integer. 

errsns Description of the most recent error. 

exit Terminates program execution. 

time Current time in hours, minutes, and seconds as an eight- 

byte string in ASCII representation. 

mvbits Moves a bit field to a different storage location. 

Table 4-2. Overview of System Subroutines 
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4.2.1 DATE 

Returns the current date as set by the system; the format is as follows: 

CALL DATE (huf) 

where buf is a variable, array, array element, or character substring nine bytes 
long. After the call, buf contains an ASCII variable in the format dd-mmm-yy, 
where dd is the date in digits, mmm is the month in alphabetic characters, and yy 
is the year in digits. 

4.2.2 IDATE 



c 



Returns the current date as three integer values representing the month, date, and 
year; the format is: 

CALL IDATE (m, d, y) 

where m, d, and y are either integer* 4 or integer* 2 values representing the 
current month, day and year. For example, the values of m, d and y on August 
10th, 1989, are: 

m = 8 
d - 10 
y - 89 



( 



( 
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4.2.3. ERRSNS 

Returns information about the most recent program error; the format is: 

CALL ERRSNS (argl, arg2, arg3, arg4, arg5) 

The arguments (argl, arg2, etc.) can be either integer* 4 or integer*2 
variables; upon return from errsns, they contain the information shown in the 
following table: 



Argument 



Contents 



argl 

arg2 
arg3 
arg4 

ar%5 



IRIX global variable errno, which is then reset to zero 
after the call 

Zero 

Zero 

Logical unit number of the file which was being 
processed when the error occurred 

Zero 



Table 4-3. Information Returned by ERRSNS 

Although only argl and agr4 return relevant information, arg2, arg3, and arg5 
are always required. 

4.2.4 EXIT 



Causes normal program termination, and optionally returns an exit-status code. 

CALL EXIT (status) 

where status is an integer* 4 or integer* 2 argument containing a status code. 
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4.Z5 TIME 

Returns the current time in hours, minutes, and seconds; the format is: 

CALL TIME (clock) 

clock can be a variable, array, array element, or character substring; it must be 
eight bytes in length. After execution, clock contains the time in the format 
hh:mm:ss 9 where hh, mm and ss are numerical values representing the hour, the 
minute, and the second. 

4.Z6 MVBITS 

Transfers a bit field from on storage location to another; the format is: 

CALL MVBITS (source, sbit, length, destination f dhit) 
The arguments are defined in the following table: 



( 



Argument 1 


Type 


Contents 


source 


integer variable or 
array element 


Source location of bit field to be 
transferred 


shit 


integer expression 


First bit position in the field to be 
transferred from source. 


length 


integer expression 


Length of the field to be transferred 
from source. 


destination 
dhit 


integer variable or 
array element 
integer expression 


Destination location of the bit field 

First bit in destination to which the 
field is transferred 


1 The arguments < 


can be declared as INTEGER*2 or INTEGER*4. 



Table 4-4. Arguments to MVBITS 



( 



( 
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4.3 Intrinsic Function Extensions 



The following table gives an overview of the intrinsic functions added as 
extensions of FORTRAN 77. 



Function Information Returned 

SECNDS Elapsed time as a floating point value in seconds. 

RAN The next number from a sequence of pseudo random numbers. 

Tabie 4-5. Intrinsic Function Extensions 

These functions are described in detail in the following sections. 
4.3.1 SECNDS 



Returns the number of seconds since midnight, minus the value of the passed 
arguments. The format is: 

s - SECNDS (n) 

After execution, s contains the number of seconds past midnight less the value 
specified by n. Both s and n are single-precision, floating point values. 

4.3.2 RAN 



Generates a random number. The format is: 

v - RAN(s) 

The argument s is an integer* 4 variable or array element; s serves as a seed in 
determining the next random number and should initially be set to a large, odd 
integer value. This permits the computation of multiple random number series 
by supply different variable names as the seed argument to ran. 
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5. FORTRAN Enhancements for 
Multiprocessors 



5.1 Overview 



The Silicon Graphics FORTRAN compiler allows you to apply the capabilities 
of a Silicon Graphics multiprocessor workstation to the execution of a single 
job. By coding a few simple directives, the compiler splits the job into 
concurrently executing pieces, thereby decreasing the run time of the job. 

This chapter discusses techniques for analyzing your program and converting it 
to multiprocessing operations. Chapter 6 gives compilation and debugging 
instructions for parallel processing. 



5.2 Parallel Loops 

The model of parallelism used focuses on the FORTRAN do loop. The compiler 
takes a do loop and executes different iterations of the loop in parallel on 
multiple processors. For example, using the simple scheduling method on a do 
loop that consists of 200 iterations and it is run on a machine with four 
processors, the first 50 iterations run on one processor, the next 50 on another, 
and so on. The multiprocessing code adjusts itself at run time to the number of 
processors actually present on the machine. Thus, if the above 200 iteration 
loop was moved to a machine with only two processors, it would be divided into 
two blocks of 100 iterations each, without any need to recompile or relink. In 
fact, multiprocessing code can even be run on single-processor machines. The 
above loop would be divided into one block of 200 iterations. This allows code 
to be developed on a single-processor Silicon Graphics IRIS-4D Series 
workstation or Personal IRIS, and later run on an IRIS POWER Series 
multiprocessor. 
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The processes that participate in the parallel execution of a task are arranged in a 
master/slave organization. The original process is the master. It creates zero or 
more slaves to assist. When a parallel do loop is encountered, the master asks 
the slaves for help. When the loop is complete, the slaves wait on the master, 
and the master resumes normal execution. The master process and each of the 
slave processes is called a thread of execution or simply a thread. By default, the 
number of threads is set equal to the number of processors on the particular 
machine. If you want, you can override the default and explicitly control the 
number of threads of execution used by a FORTRAN job. 

For multiprocessing to work correctly, the iterations of the loop must not depend 
on one another; each iteration must stand alone and produce the same answer 
regardless of whether any other iteration of the loop is executed. Not all do 
loops have this property, and loops without it cannot be correctly executed in 
parallel. However, many of the loops encountered in practice fit this model and 
are available for multiprocessing. Further, many loops that cannot be run in 
parallel in their original form can be rewritten to run wholly or partially in 
parallel. 

To provide compatibility for existing parallel programs, Silicon Graphics has 
chosen to adopt the syntax for parallelism used by Sequent Computer 
Corporation. This syntax takes the form of compiler directives embedded in 
comments. These fairly high level directives provide a convenient method for 
you to describe a parallel loop, while leaving the details to the FORTRAN 
compiler. For advanced users, there are a number of special routines that permit 
more direct control over the parallel execution. (Refer to Section 5.8, 
"Advanced Features" for more information.) 



5.3 Writing Parallel FORTRAN 

The FORTRAN compiler accepts directives that cause it to generate code that 
can be run in parallel. The compiler directives look like FORTRAN comments: 
they begin with a c in column one. If multiprocessing is not turned on, these 
statements are treated as comments. This allows the identical source to be 
compiled with a single-processing compiler, or by FORTRAN without the 
multiprocessing option. The directives are distinguished by having a $ as the 
second character. There are six directives that are supported: c$doacross , 

C$&, C$,C$MPJSCHEDTYPE, C$CHUNK,andC$COPYIN. TheC$COPYIN 

directive is described in section 5.8.10. The others are described below. 



c 



( 



( 
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5.3.1 C$DOACROSS 

The essential compiler directive is c$doacross. This directs the compiler to 
generate special code to run iterations of the do loop in parallel. The 
c$doacross statement applies only to the very next statement (which must be a 
do loop). 

The c$doacross directive has the form 

C$DOACROSS [ clause [ , clause ] ... j 

where a clause is one of 

SHARE (variable list) 
LOCAL (variable list) 
LASTLOCAL (variable list) 
REDUCTION (scalar variable list) 
IF (logical expression) 
CHUNK=integer expression 
MP__SCHEDTYPE=schedule type 

The meaning of each clause is discussed below. All of these clauses are 
optional. 

SHARE, LOCAL, LASTLOCAL 

These are lists of variables as discussed in the section 5.4, "Analyzing Data 
Dependencies for Multiprocessing". A variable may appear in only one of these 
lists. To make the task of writing these lists easier, there are several defaults. 
The loop iteration variable is lastlocal by default. All other variables are 
share by default. 

local is a little faster than lastlocal, so if you do not need the final value, it is 
good practice to put the do loop index variable into the local list, although this 
is not required. 

Only variables may appear in these lists. In particular, common blocks cannot 
appear in a local list (but see the discussion of local common blocks in the 
"Advanced Features" section). The share, local, and lastlocal lists give 
only the names of the variables. If any member of the list is an array, it is listed 
without any subscripts. 
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( 



Note; There is a small flaw in the way that unlisted variables are defaulted to 
SHARE. There must be at least one reference to the variable in a non-parallel 
region, or at least one appearance of that variable in the share list of some loop. 
If not, the compiler will complain that the variable in the multiprocessed loop 
has not been previously referenced. 

REDUCTION 

The reduction clause lists those variables involved in a reduction operation. 
The meaning and use of reductions is discussed in example 4 of the section 
"Breaking Data Dependencies." An element of the reduction list must be an 
individual variable (also called a scalar variable) and may not be an array. 
However, it may be an individual element of an array. In this case, it would 
appear in the list with the proper subscripts. 

It is possible for one element of an array to be used in a reduction operation, 
while other elements of the array are used in other ways. To allow for this, if an 
element of an array appears in the reduction list, it is legal for that array also to 
appear in the share list. 

The compiler makes some simple checks to confirm that the reduction 

expression is legal. The compiler does not, however, check all statements in the ^ 

do loop for illegal reductions. It is up to the programmer to assure legal use of I 

the reduction variable. ^ 

IF 

The IF clause gives a logical expression that is evaluated just before the loop is 
executed. If the expression is true, the loop is executed in parallel. If the 
expression is false, the loop is executed serially. Typically, the expression test 
the number of times the loop will execute in order to be sure that there is enough 
work in the loop to amortize the overhead of parallel execution. Currently, the 
break-even point is about 400 cpu clocks of work, which normally translates to 
about 100 floating point operations. 



( 
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MP^SCHEDTYPE, CHUNK 

These options affect the way the work in the loop is scheduled among the 
participating tasks. They do not affect the correctness of the loop. They are 
useful for tuning the performance of critical loops. See section 5.7.3, "Load 
Balancing" for more details. 

Four methods of scheduling the iterations are supported. A single program may 
use any or all of them as it finds appropriate. 

The simple method (mp_schedtype=simple) divides the iterations among the 
processes by dividing them into contiguous pieces, and assigning one piece to 
each process. 

The interleave scheduling method (mp_schedtype= interleave) breaks the 
iterations up into pieces of the size specified by the chunk option, and execution 
of those pieces is interleaved among the processes. For example, if there are 
four processes and chunk=2, then the first process will execute iterations 1-2, 
9-10, 17-18, ...; the second process will execute iterations 3-4, 11-12, 19- 
20,. . . ; and so on. Although this is more complex than the simple method, it is 
still a fixed schedule with only a single scheduling decision. 

In dynamic scheduling, (mp_schedtype=dynamic) the iterations are broken into 
chunk sized pieces. As each process finishes a piece, it enters a critical section 
to grab the next available piece. This gives good load balancing at the price of 
higher overhead. 

The fourth method is a variation of the guided self-scheduling algorithm 
(mp_schedtype=:GSS). Here, the piece size is varied depending on the number of 
iterations remaining. By parceling out relatively large pieces to start with, and 
relatively small pieces toward the end, the hope is to achieve good load 
balancing while reducing the number of entries into the critical section. 

In addition to these four methods, the user may specify the scheduling method 
at runtime (mp_schedtype=runtime). Here, the scheduling routine examines 
values in the user's runtime environment and uses that information to select one 
of the four methods. See the section "Advanced Features" for more details. 

If both the mp_schedtype and chunk clauses are omitted, simple scheduling is 
assumed. If mp_schedtype is set to interleave or dynamic and the chunk 
clause is omitted, chunk=i is assumed. If mp_schedtype is set to one of the 
other values, chunk is ignored. If the mp_schedtype clause is omitted, but 
chunk is set, then mp__schedtype=dynamic is assumed. 
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Example 1 

The code fragment: 

DO 10 I - 1, 100 
A(I) - B(I) 
10 CONTINUE 

could be multiprocessed with the directive: 

C$DOACROSS LOCAL (I), SHARE (A, B) 
DO 10 1=1, 100 
A(I) - B(I) 
10 CONTINUE 

Here, the defaults are sufficient, provided A and B are mentioned in a non- 
parallel region, or in another share list. The following then works: 

C$DOACROSS 

DO 10 I - 1, 100 
A(I) = B(I) 
10 CONTINUE 

Example 2 

DO 10 I - 1, N 

X = SQRT(A(I)) 
B(I) - X*C(I) + X*D(I) 
10 CONTINUE 

You can be fully explicit: 

C$D0ACR0SS L0CAL(I, X), SHARE (A f B, C, D, N) 
DO 10 I - 1, N 

X = SQRT(A(I)) 

B(I) - X*C(I) .. + X*D(I) 
10 CONTINUE 

or you can use the defaults: 



c 



( 



( 
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C$DOACROSS LOCAL (X) 

DO 10 I - 1, N 

X - SQRT(A(I)) 

B(I) - X*C(I) + X*D(I) 
10 CONTINUE 

See Example 5 in Section 5.4 for more information on this example. 

Example 3 

DO 10 I - M, K, N 
X - D(I)**2 
Y - X + X 
DO 20 J = I, MAX 

A(I,J) = A(I,J) + B(I,J) * C(I,J) * X + Y 
20 CONTINUE 

10 CONTINUE 

PRINT*, I, X 

Here, the final values of i and x are needed after the loop completes. A 
correct directive is: 

C$DOACROSS LOCAL(Y, J) , LASTLOCAL (I, X) , SHARE (M, K, N, ITOP,A,B, C, D) 
DO 10 I = M, K, N 
X - D(I)**2 
Y - X + X 
DO 20 J = I, ITOP 

A(I,J) = A(I, J) + B(I, J) * C(I,J) *X + Y 
20 CONTINUE 
10 CONTINUE 

PRINT*, I, X 

or you could use the defaults: 
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C$DOACROSS LOCAL (Y, J) , LAST LOCAL (X) 
DO 10 I - M, K, N 
X = D(I)**2 
Y - X + X 
DO 20 J - I, MAX 

A(I,J) - A(I, J) + B(I,J) * C(I,J) *X + Y 
20 CONTINUE 
10 CONTINUE 



( 



PRINT*, I, X 

i is a loop index variable for the c$doacross loop, and so it is lastlocal by 
default. However, even though j is a loop index variable, it is not the loop index 
of the loop being multiprocessed, so the variable has no special status. If it was 
not declared, it would be given the normal default of share, which would be 
wrong. 

5.3.2 C$& 



Occasionally, the clauses in the c$doacross directive are longer than one line. 
The c$& directive is used to continue the directive onto multiple lines. 

C$DOACROSS SHARE (ALPHA, BETA, GAMMA, DELTA, 
C$& EPSILON, OMEGA), LASTLOCAL (I, J, K, L, M, N) , 
C$& LOCAL (XXXI, XXX2, XXX3, XXX4, XXX5, XXX6, XXX7, 
C$& XXX8, XXX9) 



( 



The c$ directive is considered a comment line except when multiprocessing. A 
line beginning with c$ is treated as a conditionally compiled FORTRAN 
statement. The rest of the line contains a standard FORTRAN statement. The 
statement is compiled only if multiprocessing is turned on. In this case, the c 
and $ are treated as if they are blanks. They can be used to insert debugging 
statements, or an experienced user can use it to insert arbitrary code into the 
multiprocessed version. 



C$ PRINT 10 

C$ 10 FORMAT ('BEGIN MULTIPROCESSED LOOP') 

C$DOACROSS LOCAL (I) , SHARE (A, B) 
DO I - 1, 100 

CALL COMPUTE (A, B, I) 
END DO 



( 



5-8 FORTRAN 77 Programmer's Guide 



IRIS-4D Series 



5.3-4 C$MP_SCHEDTYPE, C$CHUNK 

The c$MP__scHEDTYPE=schedule__type directive acts as an implicit 
mp_schedtype clause. A doacross directive that does not have an explicit 
mp_schedtype clause is given the value specified in the directive, rather than the 
normal default If the doacross does have an explicit clause, then the explicit 
value is used. 

The c$CHUNK=integer_expression directive affects the chunk clause of a 
doacross in the same way that the c$mp_schedtype directive affects the 
mp_schedtype clause. Both directives are in effect from the place they occur in 
the source until another corresponding directive is encountered, or the end of the 
procedure is reached. 

These directives are mostly intended for users of PFA. The doacross directives 
supplied by PFA do not have mp_schedtype or chunk clauses. These directives 
provide a method of specifying what land of scheduling option is desired and 
allowing PFA to supply the doacross directive. These directives are not PFA 
specific however, and may be used by any multiprocessing FORTRAN 
programmer. 

It is also possible to invoke this functionality from the command line during a 
compile. The -mp_schedtyle=schedule_type and -chunk= integer 
command line options have the effect of implicitly putting the corresponding 
directive(s) as the first lines in the file. 

5.3.5 Nesting of C$DOACROSS 

The FORTRAN compiler does not support direct nesting of c$ doacross loops. 
For example, the following is illegal and generates a compilation error: 

C$DOACROSS LOCAL (I) 

DO I = 1, N 
C$DOACROSS LOCAL (J) 

DO J - 1, N 

A (I, J) = B(I,J) 
END DO 
END DO 
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However, to simplify separate compilation, a different form of nesting is 
allowed. A routine that uses c$doacross may be called from within a 
multiprocessed region. This can be useful if a single routine is called from 
several different places: sometimes from within a multiprocessed region, 
sometimes not. Nesting does not increase the parallelism. When the first 
c$doacross loop is encountered, that loop is run in parallel. Ifwhileinthe 
parallel loop a call is made to a routine that itself has a c$doacross, this 
subsequent loop is executed serially. 

5.3.6 Parallel Blocks 

The Silicon Graphics FORTRAN compiler supports parallel execution of do 
loops only. However, another kind of parallelism frequently occurs: different 
blocks of code independent of each other can be executed simultaneously. As a 
simple example: 

CALL MAKE1(A, B, C, D) 
CALL MAKE2 (E, F, G, H) 

If you know that these two routines do not interfere with each other, you can call 

them simultaneously. The following example shows how to use do loops to 

execute parallel blocks of code. ^ 

( 

C$DOACROSS LOCAL (I), MPJSCHEDTYPE-SIMPLE v 

DO I - 1, 2 

IF (I .EQ. 1) THEN 

CALL MAKE1(A, B, C, D) 
ELSEIF (I .EQ. 2) THEN 

CALL MAKE2(E, F, G, H) 
END IF 
END DO 



( 
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5.4 Analyzing Data Dependencies for 
Multiprocessing 

The essential condition required to correctly parallelize a loop is that each 
iteration of the loop must be independent of all other iterations. If a loop meets 
this condition, then the order in which the iterations of the loop execute is not 
important. They can be executed backwards, or even at the same time, and the 
answer is still the same. This property is captured by the notion of data 
independence. For a loop to be data independent, no iterations of the loop can 
write a value into a memory location that is read or written by any other 
iteration of that loop. It is also all right if the same iteration reads and/or writes 
a memory location repeatedly as long as no others do; it is all right if many 
iterations read the same location, as long as none of them write to it. In a 
FORTRAN program, memory locations are represented by variable names. So, 
to determine if a particular loop can be run in parallel, you can examine the way 
variables are used in the loop. Since data dependence occurs only when memory 
locations are modified, pay particular attention to variables that appear on the 
lefthand side of assignment statements. If a variable is not modified, there is no 
data dependence associated with it. 

The FORTRAN compiler supports four kinds of variable usage within a parallel 
loop: share , LOCAL f lastlocal, and reduction. If a variable is declared as 
share, all iterations of the loop use the same copy. If a variable is declared as 
local, each iteration is given its own uninitialized copy. A variable is declared 
share if it is only read (not written) within the loop, or if it is an array where 
each iteration of the loop uses a different element of the array. A variable can be 
local if its value does not depend on any other iteration, and its value is used 
only within a single iteration. In effect the local variable is just temporary; a 
new copy can be created in each loop iteration without changing the final 
answer. As a special case, if only the very last value of a variable computed on 
the very last iteration is used outside the loop (but it would otherwise qualify as 
a local variable), the loop can be multiprocessed by declaring the variable to be 
lastlocal. The use of reduction variables is discussed later. 

It is often difficult to analyze loops for data dependence information. Each use 
of each variable must be examined to see if it fulfills the criteria for local , 
lastlocal, share or reduction. If all of the variables conform, the loop can 
be parallelized. If not, the loop cannot be parallelized as it stands, but possibly 
can be rewritten into an equivalent parallel form. (See Section 5.5, "Breaking 
Data Dependencies," for information on rewriting code in parallel form.) 
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An alternative to analyzing variable usage by hand is to use the Silicon Graphics 
POWER FORTRAN ACCELERATOR™ (PFA). This optional software 
package is a FORTRAN preprocessor that analyzes loops for data dependence. 
If it can determine that a loop is data independent, it automatically inserts the 
required compiler directives (see Section 5.2, 'Writing Parallel FORTRAN"). 
If PFA cannot determine the loop to be independent, it produces a listing file 
detailing where the problems lie. 

The rest of this section is devoted to analyzing sample loops, some parallel and 
some not parallel. 

Example 1: Simple Independence 

DO 10 I - 1,N 
10 A(I) - X + B(I)*C(I) 

In this example, each iteration writes to a different location in a, and none of the 
variables appearing on the righthand side is ever written to, only read from. This 
loop can be correctly run in parallel. All of the variables are share, except for i 
either local or lastlocal, depending on whether the last value of i is used 
later in the code. 

Example 2: Data Dependence 

DO 20 I - 2,N 
20 A(I) - B(I) - A(X-l) 

This fragment contains a (i ) on the lefthand side, and a (i-i ) on the right. 
This means that one iteration of the loop writes to a location in a, and the next 
iteration reads from that same location. Since different iterations of the loop read 
and write the same memory location, this loop cannot be run in parallel. 



o 



( 



( 
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Example 3: Stride Not 1 



do 20 i = 2,n,2 
20 a(i) - b(i) - a(i-l) 

This looks very much like the previous example. The difference is that the 
stride of the do loop is now two rather than one. Now a (i ) references every 
other element of a, and a (i-i ) references exactly those elements of a that are 
not referenced by a ( i ) . None of the data locations on the righthand side are ever 
the same as any of the data locations written to on the lefthand side. The data 
are disjoint, so there is no dependence. The loop can be run in parallel. Arrays 
a and b can be declared share, while variable i should be declared local or 

LASTLOCAL. 



Example 4: Local Variable 



DO I ■ 1, N 

X = A(I)*A(I) + B(I) 

B(I) - X + B(I)*X 
END DO 

In this loop, each iteration of the loop reads and writes the variable x. However, 
no loop iteration ever needs the value of x from any other iteration. Variable x 
is used as a temporary; its value does not survive from one iteration to the next. 
This loop can be parallelized by declaring x to be a local variable within the 
loop. Note that b ( i ) is both read and written by the loop. This is not a 
problem since each iteration has a different value for i , so each iteration uses a 
different b ( i ) . The same b ( i ) is allowed to be read and written, as long as it is 
done by the same iteration of the loop. The loop can be run in parallel. Arrays 
a and b can be declared share, while variable i should be declared local or 

LASTLOCAL. 



Example 5: Function Call 



DO 10 I = 1, N 

X - SORT (A (I) ) 
B(I) = X*C(I) + X*D(I) 
10 CONTINUE 



The value of x in any iteration of the loop is independent of the value of x in any 
other iteration, so x can be made a local variable. The loop can be run in 
parallel. Arrays a, b, c and d can be declared share, while variable i should 
be declared local or lastlocal. 
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The interesting feature of this loop is that it invokes an external routine, sqrt(3f). 

It is possible to use functions and/or subroutines (intrinsic or user-defined) 

within a parallel loop. However, make sure that the various parallel invocations 

of the routine do not interfere with one another. In particular, sqrt returns a ^ 

value that depends only on its input argument, does not modify global data, and | 

does not use static storage. We say that sqrt has no side effects. V 

All of the FORTRAN intrinsic functions listed in Appendix A of the FORTRAN 
77 Language Reference Manual have no side effects, and can safely be part of a 
parallel loop. For the most part, the FORTRAN library functions and VMS 
intrinsic subroutine extensions (listed in Chapter 4 of this manual) cannot safely 
be included in a parallel loop. In particular, rand(3f) is not safe for 
multiprocessing. For user-written routines, it is the responsibility of the user to 
ensure the routines can be correctly multiprocessed. 

Caution : Routines called within a parallel loop cannot be compiled with the 
-static flag. 

Example 6: Rewritable Data Dependence 



INDX - 
DO I = 1, N 

INDX = INDX + I 

A(I) - B(I) + C(INDX) 
END DO 

Here, the value of indx survives the loop iteration and is carried into the next 
iteration. This loop cannot be parallelized as it is written. Making indx a local 
variable does not work; you need the value of indx computed in the previous 
iteration. It is possible to rewrite this loop to make it parallel (see Example 1 in 
Section 5.5). 

Example 7: Exit Branch 



DO I - 1, N 

IF (A (I) LT. EPSILON) GOTO 320 

A (I) - A(I) * B(I) 
END DO 

320 CONTINUE 

This loop contains an exit branch; that is, under certain conditions, the flow of 
control suddenly exits the loop. The FORTRAN compiler cannot parallelize 
loops containing exit branches. 
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Example 8: Complicated Independence 

DO I - K+l, 2*K 

W(I) - W(I) + B(I,K) * W(I-K) 
END DO 

At first glance, this loop looks like it cannot be run in parallel since it uses both 
w (i ) and w (i-k) . Closer inspection reveals that since the value of i varies 
between k+l and 2*k, then i-k goes from l to k. This means that the w (i-k) 
term varies from w (l) up to w (k) 9 while the w (i) term varies from w (k+l) up 
to w (2*k) . So w (i-k) in any iteration of the loop is never the same memory 
location as w (i) in any other iterations. Since there is no data overlap, there are 
no data dependencies. This loop can be run in parallel. Elements w, b, and k 
can be declared share, while variable i should be declared local or 

LASTLOCAL. 

This example points out a general rule: the more complex the expression used to 
index an array, the harder it is to analyze. If the arrays in a loop are indexed 
only by the loop index variable, the analysis is usually straightforward although 
tedious. Fortunately, in practice most array indexing expressions are quite 
simple. 

Example 9: Inconsequential Data Dependence 

INDEX - SELECT (N) 
DO I - 1, N 

A (I) - A (INDEX) 
END DO 

There is a data dependence in this loop since it is possible that at some point i 
will be the same as index, so there will be a data location that is being read and 
written by different iterations of the loop. In this particular special case, you can 
simply ignore it. You know that when i and index are equal, the value written 
into a (i ) is exactly the same as the value that is already there. The fact that 
some iterations of the loop will read the value before it is written and some after 
it is written is not important since they will all get the same value. Therefore, 
this loop can be parallelized. Array a can be declared share, while variable i 
should be declared local or lastlocal. 



Version 2.0 FORTRAN Enhancements for Multiprocessors 5-1 5 



Example 10: Local Array 



DO I 



END 



- lr 


N 










D(l) 


- A(I, 


.1) 


- A(J, 


1) 




D(2) 


- A(I f 


2) 


- A(J, 


2) 




D(3) 


- A(I, 


3) 


- A(J # 


3) 




TOTAL 


, DISTANCE (I, J) 


-SQRT(D(1) 


**2 


DO 













+ D(2)**2 + D(3)**2) 



In this fragment, each iteration of the loop uses the same locations in the d array. 
However, closer inspection reveals that the entire d array is being used as a 
temporary. This can be multiprocessed by declaring d to be local. The 
FORTRAN compiler allows arrays (even multidimensional arrays) to be local 
variables, with one restriction: the size of the array must be known at compile 
time. The dimension bounds must be constants; the local array cannot have 
been declared using a variable or with the asterisk syntax. 

Therefore, this loop can be parallelized. Arrays total_j±Lstance and a can be 
declared share, while array d and variable i should be declared local or 

LASTLOCAL. 



( 



5.5 Breaking Data Dependencies 

Many loops that have data dependencies can be rewritten so that some or all of 
the loop can be run in parallel. The essential idea is to locate the statement(s) in 
the loop that cannot be made parallel, and try to find another way to express it 
that does not depend on any other iteration of the loop. If this fails, try to pull 
the statements out of the loop and into a separate loop, allowing the remainder of 
the original loop to be run in parallel. 

The first step is to analyze the loop to discover the data dependencies (see 

Section 5.3). Once the problem areas are identified, various techniques can be 

used to rewrite the code to break the dependence. Sometimes the dependencies 

in a loop cannot be broken and you must either accept the serial execution rate 

or try to discover a new parallel method of solving the problem. The rest of this 

section is devoted to a series of "cookbook examples" on how to deal with 

commonly occurring situations. These are by no means exhaustive, but cover g 

many situations that happen in practice. f 
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Example 1: Loop Carried Value 

INDX = 
DO I - 1, N 

INDX - INDX + I 

A(I) - B(I) + C(INDX) 
END DO 

This is the same as Example 6 in Section 5.3. Here, indx has its value carried 
from iteration to iteration. However, it is possible to compute the appropriate 
value for indx without making reference to any previous value: 

C$DOACROSS LOCAL (I , INDX) 
DO I = 1, N 

INDX - (I* (I+l))/2 
A(I) - B(I) + C(INDX) 
END DO 

In this loop, the value of indx is computed without using any values computed 
on any other iteration, indx can correctly be made a local variable, and the 
loop can now be multiprocessed. 

Example 2: Indirect Indexing 

DO 100 I = 1, N 
IX - INDEXX(I) 
IY - INDEXY(I) 

XFORCE(I) = XFORCE(I) + NEWXFORCE (IX) 
YFORCE(I) - YFORCE(I) + NEWYFORCE (IY) 
IXX = IXOFFSET(IX) 
IYY - IYOFFSET(IY) 

TOTAL (IXX, IYY) - TOTAL (IXX, IYY) + EPSILON 
100 CONTINUE 

It is the final statement that causes problems. The indices ixx and iyy are 
computed in a fairly complex way, and depend on the values from the ixof f set 
and iyof f set arrays. We do not know if total (ixx, iyy) in one iteration of 
the loop will always be different from total (ixx, iyy) in every other 
iteration of the loop. We can pull the statement out into its own separate loop by 
expanding ixx and iyy into arrays to hold intermediate values: 
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C$DOACROSS LOCAL (IX, IY, I) 
DO I - 1, N 

IX - INDEXX(I) 

IY - INDEXY(I) 

XFORCE(I) = XFORCE(I) + NEWXFORCE (IX) 

XFORCE(I) = YFORCE(I) + NEWYFORCE (IY) 

IXX(I) - IXOFFSET(IX) 

IYY(I) - lYOFFSET(IY) 
END DO 

DO 100 I - 1, N 

TOTAL (IXX (I), IYY (I)) - TOTAL (IXX (I) , IYY(I)) +EPSILON 
100 CONTINUE 

Here, ixx and iyy have been turned into arrays to hold all of the values 
computed by the first loop. The first loop (containing most of the work) can 
now be run in parallel. Only the second loop must still be run serially. 

Before we leave this example, note that if we were certain that the value for ixx 
was always different in every iteration of the loop, then the original loop could 
be run in parallel. It could also be run in parallel if iyy was always different. If 
ixx (or iyy) is always different in every iteration, then total (ixx, iyy) is 
never the same location in any iteration of the loop, and so there is no data 
conflict. 

This sort of knowledge is of course program specific and should always be used 
with great care. It may be true for a particular data set, but to run the original 
code in parallel as it stands, you need to be sure it will always be true for all 
possible input data sets. 



c 



( 



Example 3: Recurrence 

DO I - 1,N 

X(I) - X(I-l) + Y(I) 
END DO 

This is an example of what is called recurrence. Recurrence exists when a 

value computed in one iteration is immediately used by another iteration. There 

is no good way of running this loop in parallel. If this type of construct appears 

in a critical loop, try pulling the statement(s) out of the loop as in the previous 

example. Sometimes another loop encloses the recurrence; in that case, try to ^ 

parallelize the outer loop. | 
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Example 4: Sum Reduction 

SUM =0.0 
DO I - 1,N 

SUM - SUM + A (I) 
END DO 

This operation is known as a reduction. Reductions occur when an array of 
values are combined and reduced into a single value. This example is a sum 
reduction since the combining operation is addition. Here, the value of sum is 
carried from one loop iteration to the next, so this loop cannot be 
multiprocessed. However, since the purpose of this loop is simply to sum the 
elements of a (i) , we can rewrite the loop to accumulate multiple, independent 
subtotals. Then we can do much of the work in parallel: 

NUM_THREADS - MP_NUMTHREADS ( ) 
C 

C IPIECE_SIZE = N/NUM_THREADS ROUNDED UP 
C 

IPIECE_SIZE - (N + (NUM_THREADS -1)) / NUM_THREADS 

DO K = 1, NUM_THREADS 
PARTIAL_SUM(K) =0.0 
C 

C THE FIRST THREAD DOES 1 THROUGH IPIECE_SIZE, THE 
C SECOND DOES IPIECE_SIZE + 1 THROUGH 2*IPIECE_SIZE, 
C ETC. IF N IS NOT EVENLY DIVISIBLE BY NUMJTHREADS, 
C THE LAST PIECE NEEDS TO TAKE THIS IJNTO ACCOUNT, 
C HENCE THE ^MIN'' EXPRESSION. 
C 

DO I =K*IPIECE_SIZE -IPIECE_SIZE +1, MIN (K*IPIECE_SIZE, N) 
PARTIAL_SUM(K) = PARTIAL_SUM (K) + A (I) 

END DO 

END DO 
C 

C NOW ADD UP THE PARTIAL SUMS 
C 

SUM =0.0 

DO I - 1 , NUMJTHREADS 

SUM = SUM + PARTIAL_SUM(I) 

END DO 

The outer k loop can be run in parallel. In this method, the array pieces for the 
partial sums are contiguous, resulting in good cache utilization and performance. 



This is an important and common transformation, and so automatic support 
provided via the reduction clause: 



is 
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SUM =0.0 
C$DOACROSS LOCAL (I). f REDUCTION (SUM) 

DO 10 I - 1, N 

SUM - SUM + A (I) 
10 CONTINUE 

This has essentially the same meaning as the much longer and more confusing 
code above. Nevertheless, this is an important example to study because this 
idea of adding an extra dimension to an array to permit parallel computation, and 
then combining the partial results, is an important technique for trying to break 
data dependencies. This idea occurs over and over in various contexts and 
disguises. 

Note that reduction transformations like this are not strictly correct. Because 
computer arithmetic has limited precision, when you sum the values together in 
a different order as was done here, the round-off errors accumulate slightly 
differently. It is likely that the final answer will be slightly different from the 
original loop. Most of the time the difference is irrelevant, but it can be 
significant, so some caution is in order. 

One further reduction is noteworthy. 



DO I = 1, N 




TOTAL = 0.0 




DO J = 1, M 




TOTAL = 


TOTAL + A (J) 


END DO 




B(I) - C(I) 


* TOTAL 


END DO 





Initially, it may look as if the reduction in the inner loop needs to be rewritten in 
a parallel form. However, look at the outer i loop. Although total cannot be 
made a local variable in the inner loop, it fulfills the criteria for a local 
variable in the outer loop: the value of total in each iteration of the outer loop 
does not depend on the value of total in any other iteration of the outer loop. 
Thus, you do not have to rewrite the loop; you can parallelize this reduction on 
the outer i loop, making total and j local variables. 



( 



( 



( 



5-20 FORTRAN 77 Programmer's Guide IRIS-4D Series 



5.6 Work Quantum 



A certain amount of overhead is associated with multiprocessing a loop. If the 
work occurring in the loop is small, you may not recover the cost of the 
overhead; the loop may actually run slower by multiprocessing than by single 
processing. To avoid this, try to make the amount of work inside the 
multiprocessed region as large as possible. 



Example 1: Loop Interchange 



DO K - 1, N 

DO I - 1, N 

DO J - 1, N 

A(I, J) = A(I,J) + B(I,K) * C(K,J) 

END DO 
END DO 
END DO 

Here you have several choices: parallelize the j loop, or the i loop. You 
cannot parallelize the k loop because different iterations of the k loop will all 
try to read and write the same values of a ( i , j ) . Try to parallelize the outermost 
do loop possible, since it encloses the most work. In this example, that is the i 
loop. For this example, use the technique called loop interchange. Although the 
parallelizable loops are not the outermost ones, you can reorder the loops to 
make one of them outermost. Thus, loop interchange would produce: 

C$DOACROSS LOCAL (I , J, K) 
DO I - 1, N 

DO K - 1, N 

DO J ■ 1, N 

A(I,J) - A(I f J) + B(I,K) * C(K,J) 
END DO 
END DO 
END DO 

Now the parallelizable loop encloses more work, and will show better 
performance. In practice, relatively few loops can be reordered in this way. 
However, it does occasionally happen that several loops in a nest of loops are 
candidates for parallelization. In such a case, it is usually best to parallelize the 
outermost one. 
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Occasionally the only loop available to be parallelized has a fairly small amount 
of work. It may be worthwhile to force certain loops to run without parallelism, 
or to select between a parallel version and a serial version, based on the length of 
the loop. 

c 

Example 2: Conditional Parallelism y 



J = 


- (N/4) 


* 4 




DO 


I - J+l 


., N 






A(I) - 


A(I) + X*B(I) 


END DO 






DO 


I - 1, 


J, 4 






A(I) - 


A(I) + X*B(I) 




A(I+1) 


= A(I+1) 


+ X*B(I+1) 




A(I+2) 


- A(I+2) 


+ X*B(I+2) 




A(I+3) 


- A(I+3) 


+ X*B(I+3) 


END DO 







Here you are using loop unrolling of order four to improve speed. For the first 

loop, the number of iterations will always be less than four, so this loop does not 

do enough work to justify running it in parallel. The second loop is worthwhile 

to parallelize if n is big enough. To overcome the parallel loop overhead, n 

needs to be around 50. An optimized version would use the if clause on the £ 

doacross directive: ^ 



J - (N/4) * 4 




DO I == J+l, N 




A(I) - A(I) + X*B(I) 


END DO 




C$DOACROSS IF (J.GE. 


50), LOCAL (I) 


DO I = 1, 


J, 4 


A(I) - 


A(I) + X*B(I) 


A(I + 1) 


- A(I+1) + X*B(I+1) 


A(I+2) 


- A(I+2) + X*B(I+2) 


A(I+3) 


- A (1+3) + X*B(I+3) 


END DO 




ENDIF 





( 
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5.7 Cache Effects 

It is good policy to write loops that take the effect of the cache into account, 
with or without parallelism. The technique for the best cache performance is also 
quite simple: make the loop step through the array in the same way that the array 
is laid out in memory. For FORTRAN, this means stepping through the array 
without any gaps^ and with the leftmost subscript varying the fastest. 

Note that this optimization does not depend on multiprocessing, nor is it 
required in order for multiprocessing to work correctly. However, 
multiprocessing can impact how the cache is used, so it is worthwhile to 
understand. 

5.7.1 Example 1 : Matrix Multiply 

DO I - 1, N 

DO K - 1, N 

DO J = 1, N 

A(I,J) = A(I, J) + B(I,K) * C(K,J) 
END DO 
END DO 
END DO 

This is the same as Example 1 in Section 5.6. To get the best cache 
performance, the i loop should be innermost. At the same time, to get the best 
multiprocessing performance, the outermost loop should be parallelized. For 
this example, you can interchange the i and j loops, and get the best of both 
optimizations: 
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C$DOACROSS LOCAL (I, J, K) 
DO J - 1, N 

DO K - 1, N 

DO I - l f N 

A(I,J) - A(I, J) + B(I,K) * C(K,J) 
END DO 
END DO 
END DO 

5-7.2 Example 2: Tradeoffs 

Sometimes, you must choose between the possible optimizations and their costs. 
Look at: 

DO J - 1, N 

DO I - 1, M 

A(I) - A(I) + B(J)*C(I,J) 

END DO 
END DO 

This loop can be parallelized on i but not on j . You could interchange the loops 
to put i on the outside, thus getting a bigger work quantum. 

C$DOACROSS LOCAL (I, J) 
DO I - 1, M 

DO J - 1, N 

A(I) - A(I) + B(J)*C(I,J) 
END DO 
END DO 

However, putting j on the inside means that you will step through the c array in 
the wrong direction: the leftmost subscript should be the one that varies the 
fastest. It is possible to parallelize the i loop where it stands, 

DO J - 1, N ■ 
C$DOACROSS LOCAL (I) 

DO I - 1, M 

A(I) - A(I) + B(J)*C(I,J) 
END DO 
END DO 

but m needs to be large for the work quantum to show any improvement. In this 
particular example, a (i ) is used to do a sum reduction, and it is possible to use 
the reduction techniques shown in Section 5.5, Example 4, to rewrite this in a 
parallel form. (Recall that there is no support for an entire array as a member of 
the reduction clause on a doacross.) However, that involves converting the 
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array a from a one-dimensional array into a two-dimensional array to hold the 
partial sums; this is analogous to the way we converted the scalar summation 
variable into an array of partial sums. If a is large, however, that may take more 
memory than you can spare. 

NUM - MP_NUMTHREADS() 

IPIECE = (N + (NUM-1)) / NUM 

C$DOACROSS LOCAL (K, J, I) 
DO K = 1, NUM 

DO J = K*IPIECE - IPIECE + 1, MIN(N, K*IPIECE) 
DO I = 1, M 

PARTIAL_A(I,K) - PARTIAL_A (I ,K) + B(J)*C(I,J) 
END DO 
END DO 
END DO 

C$DOACROSS LOCAL (I,K) 
DO I - 1, M 

DO K - 1, NUM 

A(I) = A(I) + PARTIAL_A(I,K) 
END DO 
END DO 

You must trade off the various possible optimizations to find the combination 
that is right for the particular job. 

5.7.3 Load Balancing 

When the FORTRAN compiler divides a loop into pieces, by default it uses the 
simple method of separating the iterations to contiguous blocks of equal size for 
each of the processes. It can happen that some iterations take significantly 
longer to complete than other iterations. At the end of a parallel region, the 
program waits for all processes to complete their tasks. If the work is not 
divided evenly, time is wasted waiting for the slowest process to finish. 
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Example: 

DO I = 1, N 

DO J - 1, I f 

A(J, I) = A(J, I) + B(J)*C(I) I 

END DO ^ 

END DO 

This can be parallelized on the i loop. Since the inner loop goes from 1 to i , 
the first block of iterations of the outer loop will end long before the last block 
of iterations of the outer loop. In this example, this is easy to see and 
predictable, so you can change the program: 

NUMJTHREADS = MP_NUMTHREADS ( ) 
C$DOACROSS LOCAL (I, J, K) 

DO K - 1, NUMJTHREADS 

DO I = K, N, NUMJTHREADS 
DO J - 1, I 

A(J, I) - A(J, I) +B(J)*C(I) 
END DO 
END DO 
END DO 

In this rewritten version, instead of breaking up the i loop into contiguous | 

blocks, you break it up into interleaved blocks. Thus, each of the execution ^ 

threads receives some small values of i and some large values of i , giving a 
better balance of work between the threads. Interleaving usually, but not 
always, helps cure a load balancing problem. 

This particular transformation is quite common and desirable, and so support is 
provided to do this automatically by using the mp_schedtype clause. 

C$DOACROSS LOCAL (I, J), MP_SCHEDTYPE=INTERLEAVE 
DO 20 I - 1, N 

DO 10 J = 1, I 

A (J, I) - A(J,I) + B(J)*C(J) 
10 CONTINUE 

20 CONTINUE 

This has the same meaning as the rewritten form above. 

Note that this can cause poor cache performance since you are no longer / 

stepping through the array at stride 1. This can be somewhat improved by V 

adding a chunk clause. chunk = 4 or 8 is often a good choice of value. Each 
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small chunk will have stride 1 to improve cache performance, while the chunks 
are interleaved to improve load balancing. 

The way that iterations are assigned to processes is known as "scheduling." 
Interleaving is one possible schedule. Both interleaving and the "simple" 
scheduling methods are examples of fixed schedules: the iterations are assigned 
to processes by a single decision made when the loop is entered. For more 
complex loops, it may be desirable to use dynamic or gss schedules. 
Comparing the output from pixie or from pc-sample profiling allows you to see 
how well the load is being balanced, so you can compare the different methods 
of dividing the load. Refer to the discussion of the mp_schedtype clause in 
Section 5.3.1 for more information. 

Even when the load is perfectly balanced, iterations may still take different 
amounts of time to finish because of random factors. One process may have to 
read the disk, another may be interrupted to let a different program run, etc. 
Because of these unpredictable events, the time spent waiting for all processes to 
complete may be several hundred cycles, even with near perfect balance. 
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5.8 Advanced Features 

A number of features are provided for sophisticated users to override the 
multiprocessing defaults provided by the FORTRAN system, and to customize 
the parallelism to their particular applications. This section provides a brief 
explanation of these features; additional information can be found in Section 3 
of the FORTRAN 77 Reference Manual Pages. 

5.8.1 mp_h!ock and mp_unblock 

mp_block(3f) puts the slave threads into a blocked state using the system call 
blockproc(2). The slave threads stay blocked until a call is made to 
mp_unblock(3f). These routines are useful if the job has bursts of parallelism 
separated by long stretches of single processing, as with an interactive program. 
You can block the slave processes so they consume CPU cycles only as needed, 
thus freeing the machine for other users. The FORTRAN system automatically 
unblocks the slaves upon entering a parallel region, should you neglect to do so. 

5.8.2 mp_setup, mp_create and mp_destroy 

The mp_setup(3f), mp_create(3f) and mp_destroy(3f) subroutine calls create 
and destroy threads of execution. This can be useful if the job has only one 
parallel portion or if the parallel parts are widely scattered. When you destroy 
the extra execution threads, they cannot consume system resources; they must 
be re-created when needed. Use of these routines is discouraged because they 
degrade performance; the mpjblock and mpjinblock routines can be used 
instead in almost all cases. 

mp_setup takes no arguments. It creates the default number of processes as 
defined by previous calls to mpjetjiumthreads, by the environment variable 
mp_set__numthreads , or by the number of cpus on the current hardware 
platform, mpjetup is called automatically when the first parallel loop is entered 
in order to initialize the slave threads. 

mp_create takes a single integer argument, the total number of execution threads 
desired. Note that the total number of threads includes the master thread. Thus, ^ 

mp_create(n) creates one thread less than the value of its argument. mp_destroy I 

takes no arguments; it destroys all the slave execution threads, leaving the 
master untouched. 
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When the slave threads die, they generate a sigcld signal. If your program has 
changed the signal handler to catch sigcld, it must be prepared to deal with this 
signal when mp_destroy is executed. This signal also occurs when the program 
exits; mpjlestroy is called as part of normal cleanup when a parallel 
FORTRAN job terminates. 

5.8-3 mpjblocktime 

The FORTRAN slave threads spin wait until there is work to do. This makes 
them immediately available when a parallel region is reached. However, this 
consumes CPU resources. After enough wait time has passed, the slaves block 
themselves via blockproc. Once the slaves are blocked, it requires a system call 
to unblockproc(2) to activate the slaves again. This makes the response time 
much longer when starting up a parallel region. 

This tradeoff between response time and CPU usage can be adjusted with the 
mp_blocktime(3f) call. rnp_blocktime takes a single integer argument that 
specifies the number of times to spin before blocking. By default, it is set to 
10,000,000; this takes roughly 3 seconds. If called with an argument of 0, the 
slave threads will not block themselves no matter how much time has passed. 
Explicit calls to mpjblock, however, will still block the threads. 

This automatic blocking is transparent to the user's program; blocked threads are 
automatically unblocked when a parallel region is reached. 

5.8.4 mp_numthreads, mp__set_numthreads 

Occasionally, you may want to know how many execution threads are available. 
mp_nwnthreods(3f) is a zero argument integer function that returns the total 
number of execution threads for this job. The count includes the master thread. 

mp_set_numthreads(3f) takes a single integer argument. It changes the default 
number of threads to the specified value. A subsequent call to mp_setup will use 
the specified value rather than the original defaults. If the slave threads have 
already been created, this call will not change their number. It only has an effect 
when mp_setup is called. 
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5.8-5 mp_myjhreadnum 

mpjnyjhreadnwn(3f) is a zero argument function that allows a thread to 

differentiate itself while in a parallel region. If there are n execution threads, the £ 

function call returns a value between zero and n - 1. The master thread is always \ 

thread zero. This function can be useful when parallelizing certain kinds of 

loops. Most of the time, the loop index variable can be used for the same 

purpose. Occasionally the loop index may not be accessible, as, for example, 

when an external routine is called from within the parallel loop. This routine 

provides a mechanism for those rare cases. 

5.8.6 Environment Variables: 

MP_SETJ^UMTHREADS, MPJ3LOCKTIME, 
MP„SETUP 

These environment variables act like an implicit call to the corresponding 
routine(s) of the same name at program startup time. For example, the csh 
command 



setenv MP_SET__NUMTHREADS 2 

causes the program to create 2 threads regardless of the number of cpus actually 
on the machine, just like a source statement "call mp__set_numthreads ( 2 ) ." 
Similarly the sh commands 

set MPJBLOCKTIME 
export MPJBLOCKTIME 

will prevent the slave threads from autoblocking, just like a source statement 

"call mp__blocktime (0)". 

For compatibility with older releases, the environment variable numjthreads is 
supported as a synonym for mp_set__numthreads. 



( 



( 
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5.8.7 Environment Variables: MP_SCHEDTYPE, 
CHUNK 

These environment variables specify the type of scheduling to use on doacross 
loops that have their scheduling type set to runtime. For example, the csh 
commands 

setenv MP_SCHEDTYPE INTERLEAVE 
setenv CHUNK 4 

will cause loops with the runtime scheduling type to be executed as interleaved 
loops with a chunk size of 4. The defaults are the same as on the doacross 
directive: if neither variable is set, simple scheduling is assumed. If 
mp_schedtype is set, but chunk is not set, a chunk of 1 is assumed. If chunk is 
set, but mp_schedtype is not, dynamic scheduling is assumed. 

5.8.8 Environment Variable: MP PROFILE 



By default, the multiprocessing routines use the fastest possible method of doing 
their job. This can make it difficult to determine where the time is being spent if 
the multiprocessing routines themselves seem to be a bottleneck. By setting the 
environment variable mp_profile, the multiprocessing routines will use a 
slightly slower method of synchronization, where each step in the process is 
done in a separate subroutine with a long, descriptive name. Thus pixie or pc- 
sample profiling can get more complete information regarding how much time is 
spent inside the multiprocessing routines. 

Note that it is only set/unset that is important. The value the variable is set to is 
irrelevant (and will typically be null). 

5.8.9 mp__setlock, mp__unsetlock, mp__barrier 

These zero argument functions provide convenient (although limited) access to 
the locking and barrier functions provided by ussetlock(3p), usunsetlockQv), 
and barrier(3p). The convenience is that no user initialization need by done, 
since the usconfig(3p), usinit{3^), etc. calls are done automatically. The 
limitation is that there is only one lock, and one barrier. For a great many 
programs, this is sufficient. Users needing more complex or flexible locking 
facilities should uset the ussetlock family of routines directly. 
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5.8.10 Local COMMON Blocks 

A special W(l) option allows named common blocks to be local to a process. 
This means that each process in the parallel job gets its own private copy of the 
common block. This can be helpful in converting certain types of FORTRAN 
programs into a parallel form. 

The common block must be a named common (blank common may not be made 
local) and it must not be initialized via data statements. 

To create a local common block, give the special loader directive 
-Xlocaldata followed by a list of common block names. Note that the external 
name of a common block known to the loader has a trailing underscore, and is not 
surrounded by slashes. For example, the command 

f77 -mp a.o -Xlocaldata foo_ 

would make the common block /f 00/ be a local common block in the resulting 
a.out file. 

It is occasionally desirable to be able to copy values from the master thread's 

version of the common block into the slave thread's version. The special directive £ 

c$copyin allows this. It has the form |^ 

C$COPYIN item [, item ...] 

Each item must be a member of a local common block. It may be a variable, an 
array, an individual element of an array, or the entire common block. For 
example: 

C$COPYIN x,y, /foe/, a(i) 

will propagate the values for x and y, all the values in the common block f 00, and 
the i th element of array a. All of these items must be members of local common 
blocks. Note that this directive is translated into executable code, so in this 
example i will be evaluated at the time this statement is executed. 



( 
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5.8.11 Compatibility with sproc 

The parallelism used in FORTRAN is implemented using the standard system 
call $proc(2). It is recommended that programs not attempt to use both 
c$doacross loops and sproc calls. It is possible, but there are several 
restrictions. 

Any threads you create may not execute c$doacross loops; only the 
original thread is allowed to do this. 

• The calls to mpjblock, mp_destroy, etc., apply only to the threads created 
via mp_create or to those automatically created when the FORTRAN job 
starts; they have no effect on any user-defined threads. 

• Calls to routines such as mj>etjtwnprocs(3p) do not apply to the threads 
created by the FORTRAN routines. However, the FORTRAN threads are 
ordinary subprocesses; using the routine fa7/(2) with the arguments and sig 
(kill (0 , sig) ) to signal all members of the process group might possibly 
result in the death of the threads used to execute c$doacross. 

• If you choose to intercept the sigcld signal, you must be prepared to 
receive these signal(s) when the threads used for the c$doacross loops exit; 
this occurs when mpjdestroy is called, or at program termination. 

Note in particular that mJork(5\)) is implemented using sproc, so it is not 
legal to mjork a family of processes that will each subsequently execute 
c$doacross loops. Only the original thread may execute C$doacross 
loops. 
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5.9 DOACROSS Implementation 

This section discusses how multiprocessing is implemented in a doacross \ 

routine. This information is useful when you use the debugger and interpret the 
results of an execution profile. 

5.9-1 Loop Transformation 

When the FORTRAN compiler encounters a c$doacross statement, it spools 
the corresponding do loop into a separate subroutine. The compiler then 
replaces the loop statement with a call to a special library routine. Exactly 
which routine is called depends on the value of mp_schedtype. For discussion 
purposes we will assume simple scheduling, so the library routine is 
mp_simple_sched. 

The newly created subroutine is named by prepending an underscore to the 

original routine name, and appending a number. Thus, in a routine named f oo 

with multiple loops, the first c$doacross loop is spooled into a subroutine 

named _f oo_i, the second c$doacross loop is spooled into a subroutine named £ 

_f oo_2, and so on. Any variables declared to be local in the original |^ 

c$doacross statement are declared as local variables in the spooled routine. 

References to share variables are resolved by referring back to the original 

routine. 

Since the spooled routine is now just a do loop, the mp_simple_sched routine 
specifies, through subroutine arguments, which part of the loop a particular 
process is to execute. The spooled routine has four arguments: the starting value 
for the index, the number of times to execute the loop, the amount to increment 
the index, and a special flag word. 



( 
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As an example, the following routine: 

SUBROUTINE EXAMPLE (A, B, C, N) 
REAL A(*) , B(*) , C(*) 

C$DOACROSS LOCAL (I, X) 
DO I - 1, N 

X - A(I)*B(I) 

C(I) - X + X**2 
END DO 

C(N) = A(l) + B(2) 

RETURN 

END 

produces this spooled routine to represent the loop: 

SUBROUTINE _EXAMPLE_1 
X ( _LOCAL_START, _LOCAL_NTRIP, __INCR, JTHREADINFO) 
INTEGER* 4 _LOCAL_START 
INTEGER* 4 _LOCAL_NTRIP 
INTEGER* 4 _INCR 
INTEGER*4 _THREADINFO 
INTEGER* 4 I 
REAL X 
INTEGER* 4 _DUMMY 

I = _LOCAL_START 

DO _DUMMY = l,_LOCAL_NTRIP 

X - A(I)*B(I) 

C(I) = X + X**2 
1 = 1 + 1 
END DO 

END 

Note that the compiler does not accept user code with an underscore ( _ ) as the 
first letter of a variable name. 
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5.9.2 Executing Spooled Routines 

The set of processes that cooperate to execute the parallel FORTRAN job are 
members of a process share group created via the system call sproc. The 
process share group is created by special FORTRAN startup routines that are 
used only when the executable is linked with the -mp option, which enables 
multiprocessing. 

The first process is the master process. It executes all the nonparallel portions of 
the code. The other processes are slave processes; they are controlled by the 
routine mp_slavejcontrol. When they are inactive they wait in the special 
routine mp_slave_waitJor_work. 

When the master process calls mp_simple_sched, the master passes the name of 

the spooled routine, the starting value of the do loop index, the number of times 

the loop is to be executed, and the loop index increment. The mp__simple_sched 

routine divides the work and signals the slaves. The master process then calls 

the spooled routine to do its work. When a slave is signaled, it wakes up from 

the wait loop, calculates which iteration(s) of the spooled do loop it is to 

execute, and then calls the spooled routine with the appropriate arguments. 

When a slave completes its execution of the spooled routine, it reports that it has 

finished, and returns to _jnpjlavejwaitJor_work. g 

When the master completes its execution of the spooled routine, it returns to 
mp_simplejched, then waits until all the slaves have completed processing. 
The master then returns to the main routine and continues execution. 

Turn to Chapter 6 for an example of debugger output for the stack trace 
command where, which shows the calling sequence. 



( 
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6. Compiling and Debugging Parallel 
FORTRAN 



6.1 Overview 

This chapter gives instructions on how to compile and debug a parallel 
FORTRAN program. Sections include: 

• Compiling and Running 

• Profiling a Parallel FORTRAN Program 

• Debugging Parallel FORTRAN 

• Parallel Programming Exercise 

This chapter assumes you have read Chapter 5, "FORTRAN Enhancements for 
Multiprocessors," and have reviewed the techniques and vocabulary for parallel 
processing in the IRIX environment. 



Version 2.0 Compiling and Debugging Parallel FORTRAN 6-1 



6.2 Compiling and Running 

After you have written a program for parallel processing, the next step is to 
debug your program in a single processor environment. Do this by calling the 
FORTRAN compiler with the/77 command. After your program has executed 
successfully on a single processor, you can compile it for multiprocessing. 
Check the man pages for/77(l) options for multiprocessing. 

To turn on multiprocessing, add -mp to the f77 command line. This causes the 
FORTRAN compiler to generate multiprocessing code for the particular file(s) 
being compiled. When linking, it is legitimate to mix together object files 
produced with the -mp flag and object files produced without it. If any or all of 
the files are compiled with -mp, the executable must be linked with -mp so that 
the correct libraries are used. 

6.2.1 Using the -static Flag 

A few words of caution about the -static flag: The multiprocessing 
implementation demands some use of the stack to allow multiple threads of 
execution to simultaneously execute the same code. Therefore, the parallel do 
loops themselves are compiled with the -automatic flag, even if the routine 
enclosing them is compiled with -static. 

This means that share variables in a parallel loop behave correctly according to 
the -static semantics, but local variables in a parallel loop will not (see Section 
5.3 for a description of share and local variables). 

Finally, if the parallel loop calls an external routine, that external cannot be 
compiled with -static. You can mix static and multiprocessed object files in the 
same executable; the restriction is that a static routine cannot be called from 
within a parallel loop. 
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6.2.2 Examples of Compiling 

This section steps you through a few examples of compiling code using -mp. 
The following command line: 

£77 -rap foo.f 

compiles and links the FORTRAN program /oa/into a multiprocessor 
executable. 

In this example 

£77 -c -mp -02 snark.f 

the FORTRAN routine(s) in the file snark.f art compiled with multiprocess code 
generation enabled. The optimizer is also used. A standard snark.o binary is 
produced, which must be linked: 

£77 -mp -o boo jura snark.o bellman. o 

Here, the -mp flag signals the linker to use the FORTRAN multiprocessing 
library. The file bellman.o need not have been compiled with the -mp flag 
(although it could have been). 

After linking, the resulting executable can be run like any standard executable. 
The creation of multiple execution threads, running and synchronizing them, and 
task termination are all handled automatically. 

When an executable has been linked with -mp, the FORTRAN initialization 
routines determine how many parallel threads of execution to create. This 
determination occurs each time the task starts; the number of threads is not 
compiled into the code. The default is to use the number of processors that are 
on the machine (the value returned by the system call sysrnp(MPJNAPROCS), 
see sysmp(2)). The default can be overridden by setting the shell environment 
variable mp_set_numthreads. If it is set, FORTRAN tasks will use the 
specified number of execution threads regardless of the number of processors 
physically present on the machine. mp_set_numthreads can be an integer from 
1 to 16. 
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6.3 Profiling a Parallel FORTRAN Program 

After converting a program, you need to examine execution profiles to judge the 
effectiveness of the transformation. Good execution profiles of the program are 
crucial to help focus your effort on the loops consuming the most time. 

IRIX provides profiling tools that can be used on FORTRAN parallel programs. 
Both pixie(l) and pc-sample profiling can be used. On jobs that use multiple 
threads, both of these methods will create multiple profile data files, one for 
each thread. The standard profile analyzer prof{\) can be used to examine this 
output. 

The profile of a FORTRAN parallel job is different from a standard profile. As 
mentioned in Section 5.4.1, to produce a parallel program, the compiler pulls the 
parallel do loops out into separate subroutines, one routine for each loop. Each 
of these loops is shown as a separate procedure in the profile. Comparing the 
amount of time spent in each loop by the various threads shows how well the 
workload is balanced. 

In addition to the loops, the profile shows the special routines that actually do 
the multiprocessing. The mp_simple_sched routine is the synchronizer and 
controller. Slave threads wait for work in the routine mp_slave_wait_for_work. 
The less time they wait, the more time they work. This gives a rough estimate of 
how parallel the program is. 

Section 6.5, "Parallel Programming Exercise," contains several examples of 
profiling output, and how to use the information it provides. 
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6.4 Debugging Parallel FORTRAN 

This section presents some standard techniques to assist in debugging a parallel 
program. 

6.4.1 General Debugging Hints 

• Debugging a multiprocessed program is much harder than debugging a 
single-processor program. For this reason, do as much of the debugging as 
possible on the single-processor version. 

Try to isolate the problem as much as possible. Ideally, try to reduce the 
problem to a single c$doacross loop. 

Before debugging a multiprocessed program, change the order of the 
iterations on the parallel do loop on a single-processor version. If the loop 
can be multiprocessed, then the iterations can execute in any order and 
produce the same answer. If the loop cannot be multiprocessed, changing 
the order frequently causes the single-processor version to fail, and standard 
single-process debugging techniques can be used to find the problem. 

Once you have narrowed the bug to a single file, use -g -mp_keep to save 
debugging information and save the file containing the multiprocessed do 
loop FORTRAN code that has been moved to a subroutine, -mpjteep will 
store the compiler-generated subroutines in a file name 

$TMPDIR/P<user_subroutine_name>_<machine_name><pid>. If 
$tmpdir is not set, Itmp is used. 

Example: Erroneous C$DOACROSS 

In this example, the bug is that the two references to a have the indices in 
reverse order. If the indices were in the same order (if both were a (i , j ) or both 
were a ( j , i ) ) the loop could be multiprocessed. As written, there is a data 
dependency, so the c$doacross is a mistake. 
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C$DOACROSS LOCAL (I, J) 
DO I - 1, N 

DO J - 1, N 

A(I,J) - A(J,I) + X*B(I) 
END DO 
END DO 

Since a (correct) multiprocessed loop can execute its iterations in any order, you 
could rewrite this as: 

C$DOACROSS LOCAL (I, J) 

DO I - N, 1, -1 

DO J - 1, N 

A(I,J) - A(J,I) + X*B(I) 

END DO 
END DO 

This loop no longer gives the same answer as the original even when compiled 
without the -mp flag. This reduces the problem to a normal debugging 
problem. 

• Check the local variables when the code does run correctly as a single 
process but fails when multiprocessed. Carefully check any scalar variables 
that appear in the left-hand side of an assignment statement in the loop to be 
sure they are all declared local. Be sure to include the index of any loop 
nested inside the parallel loop. 

A related problem occurs when you need the final value of a variable, but 
the variable is declared local rather than lastlocal. If the use of the final 
value happens several hundred lines farther down, or if the variable is in a 
common block and the final value is used in a completely separate routine, it 
is quite possible for a variable to look like it is local when in fact it should 
be lastlocal. To combat this, simply declare all the local variables to be 
lastlocal when debugging a loop. 

• Check for equivalence problems. Two variables of different names may 
in fact refer to the same storage location if they are associated through an 

EQUIVALENCE. 

Check for the use of uninitialized variables. Some programs assume 
uninitialized variables have the value 0. This works with the -static flag, 
but without it, uninitialized values assume the value left on the stack. When 
compiling with -mp, the program executes differently and the stack 
contents are different. You should suspect this type of problem when a 
program compiled with -mp and run on a single processor gives a different 
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result from when it is compiled without -mp. One way to track down a 
problem of this type is to compile suspected routines with -static. If an 
uninitialized variable is the problem, it should be fixed by initializing the 
variable rather than continuing to compile -static. 

Try compiling with the -C option for range checking on array references. 
If arrays are indexed out of bounds, a memory location may be referenced 
in unexpected ways. This is particularly true of adjacent arrays in a common 
block. 

• If the analysis of the loop was incorrect, one (or more) arrays that are share 
may have data dependencies. This sort of error is seen only when running 
multiprocessed. When stepping through the code in the debugger, the 
program executes correctly. In fact, this sort of error often is seen only 
intermittently, with the program working correctly most of the time. 

The most likely candidates for this error are arrays with complicated 
subscripts. If the array subscripts are simply the index variables of a do 
loop, the analysis is probably correct. If the subscripts are more involved, 
they are a good choice to examine first. 

If you suspect this type of error, as a final resort print out all the values of 
all the subscripts on each iteration through the loop. Then use uniq(l) to 
look for duplicates. If duplicates are found, then there is a data dependency. 

6.4.2 Multiprocess Debugging Session 

This section steps through the debugging of the following incorrectly 
multiprocessed code. 



Example 



subroutine total (n, m, iold, inew) 

implicit none 

integer n, m 

integer iold(n,m), inew(n,m) 

double precision aggregate (100, 100) 
common /work/ aggregate 

integer i, j, num, ii, jj 
double precision tmp 
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c$doacross local (i, ii, j, jj f num) 
do j - 2, m-1 

do i * 2, n-1 

num » 1 

if (iold(i,j) .eq. 0) then 

inew (i, j) — 1 a 

else V 

num= iold(i-l,j) + iold(i,j-l) + iold(i-l, j-1) + 
& iold(i+l, j) + iold(i,j+l) + iold(i+l, j+1) 

if (num .ge. 2) then 

inew(i,j) - iold(i f j) + 1 
else 

inew(i, j) » max(iold(i, j)-l, 0) 
end if 
end if 



aggregate (ii, jj) + inew(i,j) 



ii ■■ 


i/10 + 


1 


jj = 


j/10 + 


1 


aggregate (ii 


- r 


end 


do 




end do 






return 






end 







In the program, the local variables are properly declared. The inew always 
appears with j as its second index, so it can be a share variable when 
multiprocessing the j loop. The iold, m, and n are only read (not written), so 
they are safe. The problem is with aggregate. The person analyzing this code 
reasoned that because j is always different in each iteration, j /l o will also be 
different. Unfortunately, since j /l o uses integer division, it often gives the 
same results for different values of j . 

While this is a fairly simple error, it is not easy to see. When run on a single 
processor, the program always gets the right answer. Some of the time it gets 
the right answer when multiprocessing. The error occurs only when different 
processes attempt to load from and/or store into the same location in the 
aggregate array at exactly the same time. 
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After reviewing the debugging hints from the previous section, first try reversing 
the order of the iterations. Replace: 

do j - 2, m-1 

with 

do j = m-1, 2, -1 

This still gives the right answer when running with one process, and the wrong 
answer when running with multiple processes. 

The local variables look right, there are no equivalence statements, and 
inew uses only very simple indexing. The likely item to check is aggregate. 
The next step is to use the debugger. 

First compile the program with the -g -mp Jkeep options: 

% £77 -g -rap -rap_Jce©p driver. f total. f -o total. *x 

driver. f: 
total. f: 

This debug session is being run on a single-processor machine, so force the 
creation of multiple threads: 

% setenv MP_SET_NUMTHREADS 2 

Start the debugger 

% dbx total. ©x 

dbx version 1.31 

Copyright 1987 Silicon Graphics Inc. 

Copyright 1987 MIPS Computer Systems Inc. 

Type 'help* for help. 

Reading symbolic information of "total. ex' . . . 

MAIN: 14 14 do i = 1, isize 

Tell dbx to pause when sproc is called: 
(dbx) set $promptonfork=l 
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Start the job: 

(dbx) run 

Warning: MP_SET_NUMTHREADS greater than available cpus 

(MPjSETJflUMTHREADS - 2; cpus - 1) 
Process 19324 (total .ex) started 
Process 19324 (total .ex) has executed the "sproc" system call 

Add child to process pool (n if no) ? y 

Reading symbolic information of Process 19325 . . . 

Process 19325 (total .ex) added to pool 

Process 19324 (total .ex) after sproc [sproc. sproc: 38, 0x41el30] 

Source (of sproc. s) not available for process 19324 

Make each process stop at the first multiprocessed loop in the routine total. 
Its name will be _total_l (see Section 5.9.1), so enter: 
(dbx) stop in __total__l pgrp 

[2] stop in _total_l 
[3] stop in __total_l 

Start them all off and wait for one of them to hit a breakpoint: 

(dbx) resuma pgrp 

(dbx) waitall 

Process 19325 (total .ex) breakpoint /trace 
trap[__total_l:16,0x4006d0] 
16 j = _local__start 

(dbx) showproc 

Process 19324 (total .ex) breakpoint/trace 
trap[_total__l:16,0x4006d0] 
Process 19325 (total .ex) breakpoint/trace 
trap[__totalj.:16,0x4006d0] 
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Look at the complete listing of the multiprocessed loop routine: 



(dbx) liat 1,50 



1 






2 






3 




subroutine __total_JL 


4 




x ( __local_start, __local_ntrip, __incr, 


my threadno) 


5 




integer* 4 _local__start 


6 




integer* 4 _local__ntrip 


7 




integer*4 _incr 


8 




integer* 4 __my_threadno 


9 




integer*4 i 


10 




integer*4 ii 


11 




integer*4 j 


12 




integer*4 jj 


13 




integer*4 num 


14 




integer* 4 _dummy 


15 






>* 16 




j = _local_start 


17 




do _dummy = 1 ,_local__ntrip 


18 




do i = 2, n-1 


19 






20 




num =1 


21 




if (iold(i, j) .eq. 0) then 


22 




inew (i, j) = 1 


More (n 


. if 


no) ?y 


23 




else 


24 




num = iold(i-l, j) + iold(i f j-1) 4- iold(i- 


l,j-D 


+ 




25 




$ iold(i+l,j) + 


iold(i, 


J+D 


+ iold(i+l, j+1) 


26 




if (num .ge. 2) then 


27 




inew(i,j) = iold(i f j) + 1 


28 




else 


29 




inew(i,j) = max (iold(i, j) -1, 0) 


30 




end if 


31 




end if 


32 






33 




ii = i/10 + 1 


34 




jj = j/10 + 1 


35 






36 




aggregate (ii, jj) » aggregate (ii, jj) + 


inew (i, 


J) 




37 






38 




end do 


39 




j - j + 1 


40 




end do 
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41 

42 end 

To look at aggregate, stop at that line with: 
(dbx) stop at 36 pgrp 

[4] stop at "/tmp/Ptotalkea_11561_":36 
[5] stop at "/tmp/Ptotalkea__11561":36 

Continue the current process (the master process). Note that eont continues 
only the current process; other members of the process group (pgrp) are 
unaffected. 

(dbx) cent 

[4] Process 19324 (total .ex) stopped at [_total__l :36, 0x400974] 
36 aggregate (ii, jj) « aggregate (ii, jj) + inew(i,j) 

(dbx) \f8showproc 

Process 19324 (total. ex) breakpoint /trace 
trap [_total_l : 36, 0x400974] 
Process 19325 (total. ex) breakpoint /trace 
trap [_total_l ': 16, 0x4006d0] 

Check the Slave 

Look at the slave process with: 

(dbx) active 19325 

Process 19325 (total. ex) breakpoint /trace 
trap [_total_l : 16 , 0x4006d0] 
(dbx) cont 
[5] Process 19325 (total. ex) stopped at [_total_l :36, 0x400974] 

36 aggregate (ii, jj) = aggregate (ii, jj) + inew(i,j) 
(dbx) where 
> _total__l (_local___start - 6, _local_jitrip = 4, __incr = 1, 

my_threadno - 1) ["/tmp/Ptotalkea_JL1561" :36, 
0x400974] 

1 mp__slave__sync (0x0, 0x0, 0x1,0x1,0x0,0x0) ["mp__slave. s" :119, 
0x402964] 

Note that the slave process has entered the multiprocessed routine from the slave 
synchronization routin mpjlavejync. Both processes are now at the 
aggregate assignment statement. Look at the values of the indices in both 
processes: 



( 



( 
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(dbx) print ii 

1 

(dbx) print jj 

1 

(dbx) print ii pid 19324 

1 

(dbx) print jj pid 19324 

1 

The indices are the same in both processes. Now examine the arguments to the 
multiprocessed routine; note that this information can also be seen in the where 
command above. 

(dbx) print _local_ntrip 

4 

(dbx) print __local_start 

6 

(dbx) print j 

6 

(dbx) print _local__ntrip pid 19324 

4 

(dbx) print _local__start pid 19324 

2 

(dbx) print j pid 19324 

2 
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The analysis for this loop assumed that j / 1 o would be different for each loop 
iteration. This is the problem; confirm it by looking further into the loop: 

(dbx) active 19324 

Process 19324 (total.ex) breakpoint/trace 
trap[_total_l:36, 0x400974] 
(dbx) where 
> _total_l (__local__start = 2, _local_jntrip = 4, _incr - 1, 

_my__threadno = 
0) ["/tmp/Ptotalkea_11561":36, 0x400974] 

1 mp_simple_sched_(0x0, 0x0, 0x0, 0x0, 0x0, 0x40034c) 
[0x400e38] 

2 total. total (n = 100, m = 10, iold = (...), inew = (...)) 

["total. f":15, 
0x4005f4] 

3 MAIN() ["driver. f":25, 0x400348] 

4 main. main (0x0, 0x7fffc7a4, 0x7fffc7ac, 0x0, 0x0, 0x0) 

["main.c":35, 
0x400afc] 

(dbx) fxmc total 

[using total. total] 

total: 15 15 do j = 2, m-1 

(dbx) print m 
10 

(dbx) quit 

Process 19324 (total .ex) terminated 
Process 19325 (total .ex) terminated 
% 

There are several possible ways to correct the problem; they are left as an 
exercise for the reader. 



( 



( 



( 
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6.5 Parallel Programming Exercise 

This section steps through the techniques for applying FORTRAN loop level 
parallelism to an existing application. Each program is unique; these techniques 
must be adapted for your particular needs. 

In summary, the steps to follow are these: 

1. Mate the original code work on one processor. 

2. Profile the code to find the time-critical part(s). 

3. Perform data dependence analysis on the part(s) found in the previous step. 

4. If necessary, rewrite the code to make it parallelizable. Add c$doacross 
statements as appropriate. 

5. Debug the rewritten code on a single processor. 

6. Run the parallel version on a multiprocessor. Verify that the answers are 
conect 

7. If the answers are wrong, debug the parallel code. Always return to step 5 
(single-process debugging) whenever any change is made to the code. 

8. Profile the parallel version to gauge the effects of the parallelism. 

9. Iterate on these steps until satisfied. 
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6.5.1 First Pass 

The next several pages lead you through the process outlined above. The 
exercise is based on a model of a molecular dynamics program; the routine 
shown below will not work except as a testbed for the debug exercise. 

Step 1 : Make the Original Work 

Make sure the original code runs on a Silicon Graphics workstation before 
attempting to multiprocess it. Multiprocess debugging is much harder than 
single-process debugging, so fix as much as possible in the single-process 
version. 

Step 2: Profile 

Profiling the code enables you to focus your effort on the important parts. For 
example, initialization code is frequently full of loops that will parallelize; 
usually these set arrays to zero. This code typically uses only 1 percent of the 
CPU cycles, so working to parallelize it is pointless. 

In the example, you get the following output when you run the program with 
pixie. For brevity, we omit listing the procedures that took less than 1 percent of 
the total time. 

prof -pixie -quit 1% orig orig.Addrs orig. Counts 



( 



* -p [rocedures] using basic-block counts; sorted in * 

* descending order by the number of cycles executed in* 

* each procedure; unexecuted procedures are excluded * 



10864760 cycles 






cycles %cycles 


cum % 


cycles 
/cal. 


10176621 93.67 


93.67 


484601 


282980 2.60 


96.27 


14149 


115743 1.07 


97.34 


137 



bytes procedure (file) 
/line 

24 calc__ (/tmp/ctmpa00845) 
58 move__ (/tmp/ctmpa0Q837) 
70 t_j?utc (lio.c) 



( 
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The majority of time is spent in the calc routine, which looks like this: 

subroutine calc (num__atoms, atoms, for ce, threshold, weight) 

implicit none 

integer max__atoms 

parameter (max_atoms = 1000) 

integer num__atoms 

double precision atoms (max_atoms , 3 ) , force (max_atoms, 3) 

double precision threshold 

double precision weight (max_atoms) 

double precision dist_sq(3), total_dist_sq 
double precision threshold__sq 
integer i, j 

threshold___sq = threshold ** 2 

do i = 1 , num_atoms 

do j = 1, i-1 

dist_sq(l) = (atoms (i,l); j - atoms (j f l)) ** 2 
dist__sq(2) = (atoms(i,2) - atoms(j,2)) ** 2 
dist_sq(3) = (atoms(i f 3) - atoms(j f 3)) ** 2 

total_dist_sq = dist_sq(l) + dist_sq(2) + dist_sq(3) 

if (total__dist_sq .le. threshold_sq) then 

c 

c Add the force of the nearby atom acting on this 

c atom . . . 

c 

force (i,l) = force (i,l) + weight (i) 
force (i, 2) - force (i, 2) + weight (i) 
force (i, 3) = force (i f 3) + weight (i) 

c 

c ... and the force of this atom acting on the 

c nearby atom 

c 

force (j,l) = force (j,l) + weight (j) 
force (j, 2) = force (j, 2) + weight (j) 
force (j, 3) = force (j, 3) + weight (j) 





end 


if 




end 


do 


end 


do 




return 




end 
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Step 3: Analyze 

It is better to parallelize the outer loop if possible, in order to enclose the most 
work. To do this, analyze the variable usage. The simplest and best way is to 
use the Silicon Graphics POWER FORTRAN Accelerator. If you do not have 
access to this tool, each variable must be examined by hand. 

Data dependence occurs when the same location is written to and read. 
Therefore, any variables not modified inside the loop can be dismissed. Since 
they are read-only, they can be made share variables and do not prevent 
parallelization. In the example, num_atoms f atoms, threshold_sq, and 
weight are only read, so they can be declared share. 

Next, i and j can be local variables. Perhaps not so easily seen is that 
dist_sq can also be a local variable. Even though it is an array, the values 
stored in it do not carry from one iteration to the next; it is simply a vector of 
temporaries. 

The variable force is the crux of the problem. The iterations of force (i , * ) 
are all right. Since each iteration of the outer loop gets a different value of i , 
each iteration uses a different force ( i , * ) . If this was the only use of force, 
we could make force a share variable. However, force ( j , *) prevents this. 
In each iteration of the inner loop, something may be added to both force (i , l ) 
and force ( j , l ) . There is no certainty that i and j will never be the same, so 
you cannot directly parallelize the outer loop. The uses of force look similar to 
sum reductions, but are not quite the same. A likely fix is to use something like 
the sum reduction techniques. 

In analyzing this, notice that the inner loop runs from 1 up to i-l. Therefore, 
j is always less than i, and so the various references to force do not overlap 
with iterations of the inner loop. Thus the various force ( j , * ) references would 
not cause a problem if you were parallelizing the inner loop. 

Further, the force ( i , * ) references are simply sum reductions with respect to 
the inner loop (see Section 5.4, Example 4, for information on modifying this 
loop with a reduction transformation). It appears you can parallelize the inner 
loop. This is a valuable fallback position should you be unable to parallelize the 
outer loop. 



( 



( 



c 
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But the idea is still to parallelize the outer loop. Perhaps sum reductions might 
do the trick. However, remember roundoff error: accumulating partial sums 
gives different answers from the original, because of the limited precision nature 
of computer arithmetic. Depending on your requirements, sum reduction may 
not be the answer. The problem seems to center around force , so try pulling 
those statements entirely out of the loop. 

Step 4: Rewrite 

Rewrite the loop as follows; changes are noted in all caps. 

subroutine calc (num_atoms, atoms, force, threshold, weight) 

implicit none 

integer max__atoms 

parameter (max_atoms - 1000) 

integer num_atoms 

double precision atoms (max_atoms , 3 ) , f orce (max_atoms , 3 ) 

double precision threshold, weight (max_atoms) 

LOGICAL FLAGS (MAX_ATOMS , MAX_ATOMS ) 

double precision dist__sq(3), total_dist_sq 
double precision threshold_sq 
integer i, j 

threshold_sq = threshold ** 2 
C$DOACROSS LOCAL (I, J, DIST_SQ, TOTAL_DIST_SQ) 
do i = 1, num__atoms 

do j = 1, i-1 

dist_sq(l) = (atoms (i, 1) - atoms (j,l)) ** 2 
dist__sq(2) = (atoms (i, 2) - atoms (j, 2)) ** 2 
dist_sq(3) = (atoms(i,3) - atoms(j,3)) ** 2 

total_dist_sq=dist_sq(l)+dist_sq(2)+ dist_sq(3) 
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c 

C SET A FLAG IF THE DISTANCE IS WITHIN THE 

C THRESHOLD 

C 

if (total__dist__sq .le. thresholding) then 

FLAGS (i,j) - .TRUE. 
ELSE 

FLAGS (i,j) - .FALSE, 
end if 

end do 
end do 

DO I - 1, NUM_ATOMS 
DO J - 1, 1-1 
IF (FLAGS (I , J)) THEN 
c 

c Add the force of the nearby atom acting on this 

c atom . . . 

c 

force (i,l) = force (i,l) + weight (i) 
force (i, 2) = force (i r 2) + weight (i) 
force (i, 3) «- force (i, 3) + weight (i) 
c 

c ... and the force of this atom acting on the 

c nearby atom 

c 

force (j,l) = force (j,l) + weight (j) 
force (j, 2) = force (j, 2) + weight (j) 
force (j, 3) » force (j, 3) + weight (j) 
END IF 
END DO 
END DO 

return 
end 

You have parallelized the distance calculations, leaving the summations to be 
done serially. Because you did not alter the order of the summations, this should 
produce exactly the same answer as the original version. 



( 



( 



( 
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Step 5: Debug on a Single Processor 

The temptation might be strong to rush the rewritten code directly to the 
multiprocessor at this point. Remember, single-process debugging is easier than 
multiprocess debugging. Spend time now to compile the code without the -mp 
flag and correct it, to save time later. 

A few iterations should get it right. 

Step 6: Run the Parallel Version 

Compile the code with the -mp flag. As a further check, do the first run with 
the environment variable mp_set_numthreads set to 1. When this works, set 
mp_set_numthreads to 2, and run the job multiprocessed. 

Step 7: Debug the Parallel Version 

If you get the correct output from the version with one thread but not from the 
version with multiple threads, you need to debug the program while running 
multiprocessed. Refer to Section 6.3.1, "General Debugging Hints," for help. 

Step 8: Profile the Parallel Version 

After the parallel job executes correctly, check whether the run time has 
improved. First, compare an execution profile of the modified code compiled 
without -mp to the original profile. This is important because in rewriting the 
code for parallelism, you may have introduced new work. In this example, 
writing and reading the flags array, plus the overhead of the two new do loops, 
is significant The pixie output on the modified code shows the difference: 
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prof -piaci© -quit 1% tryl tryl.Addrs tryl. Counts 



* -procedures} using basic-block counts; sorted in * 

* descending order by the number of cycles executed in * 

* each procedure; unexecuted procedures are excluded * 



( 



13302554 cycles 

cycles %cycles cum % 



12479754 


93.81 


93. 


.81 


594274 


282980 


2.13 


95, 


.94 


14149 


155721 


1.17 


97, 


.11 


43 



cycles bytes procedure (file) 
/call /line 



25 calc_ (/tmp/ctmpa00857) 
58 move_ (/tmp/ctmpa00837) 
29 flsbuf (flsbuf.c) 



The single-processor execution time has increased by about 30 percent 

Look at an execution profile of the master thread in a parallel run and compare it 
to these single-process profiles: 

prof -pixie -quit 1% tryl.mp tryl .mp.Addrs tryl .mp. Counts 00 421 



* -p[rocedures] using basic-block counts; sorted in 

* descending order by the number of cycles executed in 

* each procedure; unexecuted procedures are excluded 



( 



12735722 cycles 

cycles %cycles cum % 



6903896 54.21 54.21 

3034166 23.82 78.03 
(mp_simple_sched. s) 

1812468 14.23 92.26 
(/tmp/fMPcalc_) 

294820 2.31 94.57 

282980 2.22 96.79 



cycles bytes procedure (file) 

/call /line 

328767 37 calc__ (/tmp/ctmpa008 69) 
137 917 16 mp__wa it master 

86308 19 _calc_l_ 

294820 13 mp_create (mp_utils.c) 
1414 9 58 move__ (/tmp/ctmpa00837) 



( 
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Multiprocessing has helped only very little as compared to the single-process 
run of the modified code: the program is running slower than the original. What 
happened? The cycle counts tell the story. The routine calc__ is what remains 
of the original routine after the c$doacross loop _calc_i_ is extracted. caic_ 
still takes nearly 70 percent of the time of the original. When you pulled the 
code for force into a separate loop, you had to remove too much from the loop. 
The serial part is still too large. 

Additionally, there seems to be a load balancing problem. The master is 
spending a large fraction of its time waiting for die slave to complete. But even 
if the load were perfectly balanced, there would still remain the 30 percent 
additional work that the multiprocessed version does. Trying to fix the load 
balancing right now will not solve the general problem. 

6.5.2 Regroup and Attack Again 

Now is the time to try a different approach. If the first attempt does not give 
precisely the desired result, regroup and attack from a new direction. 

Repeat Step 3: Analyze 

At this point, roundoff errors may not be so terrible. Perhaps you can try to adapt 
the sum reduction technique to the original code. 

Although the calculations on force are not quite the same as a sum reduction, 
you can use the same technique: give the reduction variable one extra dimension 
so that each thread gets its own separate memory location. 
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Repeat Step 4: Rewrite 

As before, changes are noted in all caps: 

subroutine calc (num_atoms, atoms , force, threshold, weight) m 

implicit none V 

integer max__atoms 

parameter (max__atoms * 1000) 

integer num_atoms 

double precision atoms (max_atoms , 3 ) , force (max_atoms , 3) 

double precision threshold 

double precision weight (max__atoms) 

double precision dist__sq(3) 

double precision threshold___sq 

integer i, j 

INTEGER MP_SET_NUMTHREADS, MPJSIUMTHREADS 

INTEGER BLOCK_SIZE, THREAD___INDEX 

EXTERNAL MP_NUMTHREADS 

DOUBLE PRECISION PARTIAL (MAX_ATOMS, 3, 4) 

threshold__sq ■■ threshold ** 2 

MP__SET_NUMTHREADS = MPJSIUMTHREADS ( ) 
C 

C INITIALIZE THE PARTIAL SUMS 
C 
C$DOACROSS LOCAL(THREAD__INDEX,I,J) 

DO THREAD__INDEX = 1, MP_JSET_NUMTHREADS 
DO I = 1, NUM__ATOMS 
DO J - 1, 3 
PARTIAL (I, J,THREAD__INDEX) - 0.0D0 
END DO 
END DO 
END DO 

BLOCK_SIZE - (NUM_ATOMS + (MP_SET_NUMTHREADS-1) ) / 
& MP__SET__NUMTHREADS 

C$DOACROSS LOCAL (THREAD__INDEX, I, J, DISTJSQ, TOTALED I ST_SQ) 
DO THREAD__INDEX - 1, MP_SET_NUMTHREADS 

DO I - THREAD_INDEX*BL0CKJSIZE - BLOCK_SIZE + 1, 
$ MIN (THREAD INDEX*BLOCK SIZE, NUM ATOMS) 



( 



( 
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do j = 1, i-1 

dist_sql = (atoms (i,l) - atoms (j f l)) ** 2 
dist_sq2 = (atoms (i, 2) - atoms (j, 2)) ** 2 
dist_sq3 = (atoms (i, 3) - atoms (j, 3)) ** 2 

total_dist_sq = dist_sql + dist___sq2 + dist_sq3 

if (total_dist__sq .le. threshold_sq) then 



c 

c Add the force of the nearby atom acting on this 

c atom . . . 

c 

PARTIAL (i,l,THREADj:NDEX) = PARTIAL (i,l, 
+ THREAD_INDEX) + weight (i) 

PARTIAL (i, 2, THREAD_INDEX) - PARTIAL (i, 2 , 
+ THREADJENDEX) + weight (i) 

PARTIAL (i, 3, THREAD_INDEX) = PARTIAL (i, 3, 
+ THREAD_INDEX) + weight (i) 
c 

c ... and the force of this atom acting on the 

c nearby atom 

c 

PARTIAL (j,l,THREAD_INDEX) - PARTIAL (j, 1,THREAD_INDEX) 
+ + weight ( j) 

PARTIAL ( j , 2 , THREAD_INDEX) - PARTIAL ( j , 2 , THREAD_INDEX) 
+ + weight ( j) 

PARTIAL (j, 3, THREAD_INDEX) = PARTIAL (j , 3, THREAD_INDEX) 
+ + weight ( j) 





end 


if 




end 


do 


end 


do 
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c 

C TOTAL UP THE PARTIAL SUMS 
C 

DO I - 1, NUM_ATOMS 

DO THREAD__INDEX - 1, MPJSETJJUMTHREADS 
FORCE (1,1) - FORCE (1,1) + PARTIAL (I, 1, THREADJENDEX) 
FORCE (I, 2) - FORCE (I, 2) + PARTIAL (I, 2, THREADJENDEX) 
FORCE (I f 3) - FORCE (I f 3) + PARTIAL (I, 3, THREAD_INDEX) 
END DO 
END DO 

return 
end 

Repeat Step 5: Debug on a Single Processor 

Because you are doing sum reductions in parallel, the answers may not exactly 
match the original. Be careful to distinguish between real errors and variations 
introduced by roundoff. In this example, the answers agreed with the original 
for 10 digits. 

Repeat Step 6: Run the Parallel Version 

Again, because of roundoff, the answers produced vary slightly depending on 
the number of processors used to execute the program. This variation must be 
distinguished from any actual error. 

Repeat Step 7: Profile the Parallel Version 

The output from ihc pixie run for this routine looks like this: 

prof -pixie -quit 1% try2.mp try2 .mp. Addrs try2 .mp. Counts 00 42 3 



* -procedures] using basic-block counts; sorted in * 

* descending order by the number of cycles executed in * 

* each procedure; unexecuted procedures are excluded * 



10036679 cycles 

cycles %cycles cum % cycles bytes procedure (file) 

/call /line 

139908 16 mpjwaitmaster 

144223 31 _calc_2__ (/tmp/fMPcalc__) 

14149 58 move_ (/tmp/ctmpa00837) 
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6016033 


59.94 


59. 


,94 


(mp simple 


sched. s) 






3028682 


30.18 


90. 


.12 


282980 


2.82 


92. 


.94 



( 



( 



( 



194040 1.93 94.87 9240 41 calc_ (/tmp/ctmpa00881) 
115743 1.15 96.02 137 70 t_putc (lio.c) 

With this rewrite, calc_ now accounts for only a small part of the total. You 
have pushed most of the work into the parallel region. Since you added a 
multiprocessed initialization loop before the main loop, that new loop is now 
named _calc_i_ and the main loop is now _calc_2_ . The initialization took 
less than 1 percent of the total time, and so does not even appear on the listing. 

The large number for the routine mp_waitmaster indicates a problem. Look at 
the pixie run for the slave process: 

\f8prof -pixie -quit 1% try2.mp try2.mp.Addrs 
try2.mp.Counts00424 \f7 



-p[rocedures] using basic-block counts; sorted in 
descending order by the number of cycles executed in 
each procedure; unexecuted procedures are excluded 



10704474 cycles 

cycles %cycles cum % cycles bytes procedure (file) 

/call /line 



7701642 71.95 71.95 366745 
(/tmp/fMPcalc_) 

2909559 27.18 99.13 67665 
(mp_slave.s) 



31 __calc_2_ 

32 mp_slave___wait__for_work 



The slave is spending more than twice as many cycles in the main 
multiprocessed loop as the master. This is a severe load balancing problem. 

Repeat Step 3 Again: Analyze 

Examine the loop again. Since the inner loop goes from 1 to i-l , the first few 
iterations of the outer loop have far less work in them than the last iterations. 
Try breaking the loop into interleaved pieces rather than contiguous pieces. 
Also, since the partial array should have the leftmost index vary the fastest, 
flip the order of the dimensions. For fun, we will put some loop unrolling in the 
initialization loop. This is a marginal optimization since the initialization loop is 
less than 1 percent of the total execution time. 

Repeat Step 4 Again: Rewrite 

The new version looks like this, with changes capitalized. 
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subroutine calc (num__atoms , atoms , force , threshold , weight ) 

implicit none 

integer max__atoms 

parameter (max___atoms - 1000) 

integer num_atoms f 

double precision atoms (max__atoms f 3 ) , force (max_atoms, 3) i 

double precision threshold 

double precision weight (max__atoms) 

double precision dist__sq(3), total_dist__sq 
double precision threshold__sq 

integer i, j 

integer MP_SET_NUMTHREADS , mp__numthreads , thread_index 

external mp__numthreads 

double precision partial (3, max_atoms, 4) 

threshold__sq - threshold ** 2 

MP_SET_NUMTHREADS = mp_numthreads () 

c 

c Initialize the partial sums 

c 

c$doacross local (thread__index, i, j) 

do thread_index - 1, MP_SET___NUMTHREADS 
do i - 1, num__atoms 
PARTIAL (1, I, THREAD_XNDEX) - O.OdO 
PARTIAL (2 f I f THREAD__INDEX) = O.OdO 
PARTIAL (3, I,THREAD_INDEX) - O.OdO 
end do 
end do 

c$doacross local (thread__index, i, j, dist_sq, total_dist_sq) 



( 



( 
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do thread_index = 1, MP_SET_NUMTHREADS 

DO I - THREAD_INDEX, NUM_ATOMS, MP_SET_NUMTHREADS 

do j - 1, i-1 

dist_sql - (atoms (i,l) - atoms (j,l)) ** 2 
dist_sq2 = (atoms (i, 2) - atoms (j, 2)) ** 2 
dist_sq3 = (atoms (i, 3) - atoms (j, 3)) ** 2 

total_dist_sq *■ dist__sql + dist_sq2 + dist_sq3 

if (total__dist_sq .le. threshold_sq) then 
c 

c Add the force of the nearby atom acting on this 

c atom . . . 

c 

partial (1, i,thread_index) = partial (1, i, thread_index) 
+ + weight (i) 

partial (2, i, thread_index) = partial (2, i, thread_index) 
+ + weight (i) 

partial (3, i, thread_index) = partial (3, i, thread_index) 
+ + weight (i) 

c 
c 
c 

partial (1, j , thread_index) = partial (l,j, thread_index) 
+ + weight ( j) 

partial (2 , j , thread_index) = partial (2 , j , thread_index) 
+ + weight ( j) 

partial (3 , j , thread__index) = partial (3 , j , thread_index) 
+ + weight ( j) 



end 


if 


end 


do 


end do 




ENDDO 
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c Total up the partial sums 
c 

DO THREAD__INDEX - 1, MP_SETJHUMTHREADS 
DO I - 1, NUMJSlTOMS 

force (i,l) = force (i,l) + partial (1, i,thread__index) 
force (i, 2) = force (i, 2) + partial (2, i,thread_index) 
force (i, 3) = force (i, 3) + partial (3, i,thread_index) 
end do 
end do 

return 
end 

With these final fixes in place, repeat the same steps to check out the changes: 

1 . Debug on a single processor. 

2. Run the parallel version. 

3. Debug the parallel version. 

4. Profile the parallel version. 
Repeat Step 7 Again: Profile 

The pixie output for the latest version of the code looks like this: 

\f8prof -pixie -quit 1% try3.mp try3 .mp.Addrs 
try3.mp.Counts00425\f7 



( 



( 



-p[rocedures] using basic-block counts; sorted in 
descending order by the number of cycles executed in 
each procedure; unexecuted procedures are excluded 



( 
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7045818 cycles 



cycles 


%cycles 


cum % 


cycles 
/call 


5960816 


84.60 


84.60 


283849 


282980 


4.02 


88.62 


14149 


179893 


2.75 


91.37 


4184 


(mp simple 


sched. s) 






159978 


2.55 


93.92 


7618 


115743 


1.64 


95.56 


137 



bytes procedure (file) 
/line 

31 _calc_2_ (/tmp/fMPcalcJ 
58 move_ (/tmp/ctmpa00837) 
16 mp_waitmaster 

41 calc__ (/tmp/ctmpa00941) 
70 t_jputc (lio.c) 



This looks quite good. To be sure you have solved the load balancing problem, 
check that the slave output shows a roughly equal amount of time spent in 
_calc_2_. Once this is verified, you are finally finished. 

Epilogue 

After considerable effort, you reduced execution time by about 30 percent by 
using two processors. Since the routine you multiprocessed still accounts for the 
majority of work, even with two processors, you would expect considerable 
improvements by moving this code to a four-processor machine. Since the code 
is parallelized, no further conversion is needed for the more powerful machine; 
you can just transport the executable image and run it. 

Note that you have added a noticeable amount of work to get the 
multiprocessing correct; the run time for a single processor has degraded nearly 
30 per cent. This is a big number, and it may be worthwhile to keep two 
versions of the code around: the version optimized for multiple processors and 
the version optimized for single processors. Frequently the performance 
degradation on a single processor is not nearly so large, and is not worth the 
bother of keeping multiple versions around. You can simply run the 
multiprocessed version on a single processor. The only way to know what to 
keep is to run the code and time it. 
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Appendix A: Runtime Error 

Messages 



This appendix lists possible FORTRAN runtime I/O errors. Other errors given 
by the operating system may also occur. 

Each error is listed on the screen either alone or with this phrase appended to it: 

apparent state: unit <num> named <user file name> 

last format: <string> 

lately (reading, writing) (sequential, direct, indexed) 

(formatted, unformatted) (external, internal) 10 

When the FORTRAN runtime system detects an error, the following action takes 
place: 

A message describing the error is written to the standard error unit 
(UnitO). 

• A core file, which can be used with dhx (the debugger) to inspect the 
state of the program at termination, is produced if an environment 
variable "f77-dump-flag" is defined and set to "y". 

When a runtime error occurs, the program terminates with one of the error 
messages shown in the following table. All of the errors in the table are output 
in the format <user file name>: <message>. 
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Message mmmmssss ^^ 

end of file -1 bad string: error in format 

error in format 

illegal unit number 

formatted io not allowed 

unformatted io not allowed 

direct io not allowed 

sequential io not allowed 

can't backspace file 

null file name 

can't stat file 

unit not connected 

read unexpected character 

blank logical input field 

bad variable type 

bad namelist name 

variable not in namelist 

no end record 

variable count incorrect 

negative repeat count 

illegal operation for unit 

off beginning of record 

no * after repeat count 

'new' file exists 

can't find 'old' file 

unknown system error 

requires seek ability 

illegal argument 

duplicate key value on write 

indexed file not open 

bad isam argument 

bad key description 

too many open indexed files 



( 



( 



( 



Table A-1. Runtime Error Messages 
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Message ___»____^ 

corrupted isam file 

isam file not opened for exclusive access 

record locked 

key already exists 

cannot delete primary key 

beginning or end of file reached 

cannot find request record 

current record not defined 

isam file is exclusively locked 

filename too long 

cannot create lock file 

record too long 

key structure does not match file structure 

direct access on an indexed file not allowed 

keyed access on a Ml sequential file not allowed 

keyed access on a relative file not allowed 

must specify record length 

key field value type does not match key type 

character key field value length too long 

fixed record on Ml sequential file not allowed 

variable records allowed only on unformatted Ml sequential file 

stream records allowed only on formatted Ml sequential file 

maximum number of records in direct access file exceeded 

attempt to write to a readonly file 

must specify key descriptions 

carriage control not allowed for unformatted units 

indexed files only unknown type in lio:illegal error number 204 

illegal operation on indexed file 

illegal operation on indexed or append file 

Table A-1 (continued). Runtime Error Messages 
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Additional errors: 



format too complicated: <string> 

unknown code in dojio: <num> <string> 

repetition: incomprehensible list input | 



no star: incomprehensible list input 
no comma:incomprehensible list input 
no ): incomprehensible list input 
logical: incomprehensible list input 
no quote: incomprehensible list input 
no space: out of free space 
rd_ned, unexpected code: <num> <string> 
e__ed, unexpected code: <num> <string> 



Table A-2. More Runtime Error Messages 



( 



( 
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allgn8 compiler option, 1-9, 2-3, 

2-6 
alignl 6 compiler option, 1-9, 2-3, 

2-6 
alignment, 1-9, 2-2, 2-5 
archiver, 1-18 

arguments, passing, 3-4, 3-20 
arrays, 2-4 

C,3-6 

character, 3-14 

Pascal, 3-22 
automatic compiler option, 1-13, 

6-2 



-C compiler option, 1-9, 6-7 
C functions, 3-2 
c$, 5-8 
c$&, 5-8 
c$doacross, 5-3, 5-35 

nesting, 5-9 
c$mp__schedtype, 5-9 
cache, 5-24 
character arrays, 3-14 
character variables, 3-1 
chunk, 5-5, 5-27 
common blocks, 2-4, 3-8, 3-24, 

5-3, 5-32, 6-7 
compilation, 1-3 
compiler options, 1-3, 1-9 



data dependencies, 5-1 1 

breaking, 5-17 

complicated, 5-15 

inconsequential, 5-16 

rewritable, 5-14 
data types 

alignment, 1-9, 2-2, 2-3, 2-5 

C,3-4 

FORTRAN, 3-4, 3-20 

Pascal, 3-20 



size, 2-2 

value ranges, 2-2, 2-3 
date, 4-8 

debugging, 1-14, , 6-6, 6-22 
direct files, 1-19 
do loops, 5-1,5-11 
DOACROSS, 5-9 
driver options, 1-8 
drivers, 1-2 
dynamic scheduling, 5-5 



environment variables, 5-31 , 5-32, 

6-3 
equivalence statements, 6-7, 6-10 
error handling, 1-21 
errsns, 4-9 
executable object, 1-4 
exit, 4-9 
external files, 1-19 



f77 driver, 1-2 
file formats, 1-19 
files, 1-19 

direct, 1-19 

external, 1-19 

positions when opened, 1-20 

preconnected, 1-20 

sequential unformatted, 1-19 

unknown status, 1-20 
formats 

files, 1-19 
functions 

declarations 
C.3-3 
Pascal, 3-1 8 

in parallel loops, 5-14 

intrinsic, 4-11, , 5-14 
ran, 4-1 1 
secnds, 4-1 1 

library, 4-1, 5-14 

side effects, 5-14 
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-g compiler option, 1-14, 6-10 
graphics library, 1-4, 1-7 
guided self-scheduling, 5-5 



I 



idate, 4-8 

IF clause, 5-4 

interleave scheduling, 5-4, 5-27 

intrinsic functions, 4-1 1 

intrinsic subroutines, 4-7 



-I compiler option, 1-7 
LASTLOCAL, 5-3, 5-11 
libraries 

graphics, 1-4, 1-7 

link, 1-7 

specifying, 1-8 
library functions, 4-1 
link libraries, 1-7 
linking, 1-6 

-Im compiler option, 1-8 
LOCAL, 5-3, 5-1 1 
loop interchange, 5-22 
loops, 5-1 

data dependencies, 5-1 1 

tranformation, 5-54 
-Ip compiler option, 1-8 



M 

makefiles, 3-1 6 

master processes, 5-2, 5-37 

misaligned data, 2-5 

mkf2c, 3-9 

-mp compiler option, 5-37, 6-2 

mp_block, 5-29 

MP_SCHEDTYPE, 5-4, 5-9 

mp_unblock, 5-29 

multi-language programs, 1-5 

mvbits, 4-1 



object files, 1-4, 1-17 
object module, 1-4 
objects 

linking, 1-6 
-onetrip compiler option, 1-12 
optimizing, 1-15 



Pascal, 3-17 

Pascal main routine, 3-18 

passing arguments, 3-4, 3-5, 3-20 

parameters 

reduction of, 3-1 1 
PFA, 5-12, ,6-19 
power FORTRAN accelerator, 5- 

12,6-19 
preconnected files, 1-20 
processes 

master, 5-2, 5-37 

slave, 5-2, 5-37 
profiling, 1-14,6-5,6-17 
programs 

multi-language, 1-5 



ran, 4-1 1 
rand, 5-14 
records, 1-19 
REDUCTION, 5-4, 5-19 
-Rflags compiler option, 1-13 
run-time errors, 1-21 
runtime scheduling, 5-5 



c 



( 



( 
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secnds, 4-1 1 
self-scheduling, 5-5 
sequential unformatted, 1-19 
sequential unformatted files, 1-19 
scheduling methods, 5-4, 5-27, 

5-^35 

dynamic, 5-5 

guided self-scheduling, 5-5 

interleave, 5-4 

runtime, 5-5 

simple, 5-4 
SHARE, 5-3, 5-1 1 
SIGCLD, 5-30, 5-35 
simple scheduling, 5-4 
size of data types, 2-2 
slave processes, 5-2, 5-37 
source files, 1-3 
spooled routines, 5-35 
sproc, 5-34 
static compiler option, 1-13, 5-14, 

6-2, 6-7 
subprograms, 3-2 
subroutines 

intrinsic, 4-7, 5-14 

system, 4-7 
date, 4-8 
errsns, 4-9 
exit, 4-9 
idate, 4-8 
mvbits, 4-1 
time, 4-10 
system interface, 4-1 
system subroutines, 4-7 



value ranges, 2-2, 2-3 
variables 

in parallel loops, 5-1 1 

local, 5-13 
-vms_cc compiler option, 1 -1 3 
-vms_endfile compiler option, 1-13 
-vms__stdin compiler option, 1-13 

w 

-w66 compiler option, 1-13 
-wf compiler option, 2-6 
work quantum, 5-22 
wrapper generators, 3-9 



-Xlocaldata loader directive, 5-33 



T 

time, 4-10 

u 

-U compiler option, 1-13 
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